# What's So Bad About Ad-Hoc Mathematical Definitions?

post by johnswentworth · 2021-03-15T21:51:53.242Z · LW · GW · 52 comments## Contents

Proxies vs Definitions You Already Know The Answer … But Have You Fully Specified The Answer? Recap None 52 comments

Suppose it’s the early twentieth century, and we’re trying to quantify the concept of “information”. Specifically, we want to measure “how much information” one variable contains about another - for instance, how much information a noisy measurement of the temperature of an engine contains about the actual engine temperature.

Along comes __Karl Pearson__, and suggests using his “correlation coefficient” (specifically the square of the correlation coefficient, . As a measure of information, this has some sensible properties:

- If there’s no information, then is zero.
- If is one, then there’s perfect information - one variable tells us everything there is to know about the other.
- It’s symmetric: the amount of information which X tells us about Y equals the amount of information which Y tells us about X.

As an added bonus, it’s mathematically simple to calculate, estimate, and manipulate. Sure, it’s not very “principled”, but it seems like a good-enough measure to work with.

Now an engineer from Bell Telephone shows up with a real-world problem: they’ve been contracted to create secure communications for the military. They want to ensure that externally-visible data Y contains no information about secret message X, so they need a way to measure “how much information” one variable contains about another. What a perfect use-case! We advise them to design their system so that X and Y have zero correlation.

A few years later, Bell Telephone gets a visit from a very unhappy colonel. Apparently the enemy has been reading their messages. Zero correlation was not enough to keep the secret messages secret.

Now, Bell *could* patch over this problem. For instance, they could pick a bunch of functions like , , , etc, and require that those *also* be uncorrelated. With enough functions, and a wide enough variety, that might be enough… but it’s going to get very complicated very quickly, with all these new design constraints piling up.

Fortunately, off in a corner of Bell Labs, one of their researchers already has an alternative solution. Claude Shannon suggests quantifying “how much information” X contains about Y using his “mutual information” metric . This has a bunch of sensible properties, but the main argument is that is exactly the difference between the average number of bits one needs to send in a message in order to communicate the value of X, and the average number of bits one needs to send to communicate X if the receiving party already knows Y. It’s the number of bits “savable” by knowing Y. By imagining different things as the “message” and thinking about how hard it is to *guess* X after knowing Y, we can intuitively predict that this metric will apply to lots of different situations, including Bell’s secret message problem.

Shannon advises the engineers to design their system so that X and Y have zero mutual information. And now, the enemy can’t read their messages quite so easily.

## Proxies vs Definitions

In this story, what does the correlation coefficient do “wrong” which mutual information does “right”? What’s the generalizable lesson here?

The immediate difference is that correlation is a *proxy* for amount of information, while mutual information is a true definition/metric. When we apply optimization pressure to a proxy, it breaks down - that’s __Goodheart’s Law__ [? · GW]. In this case, the optimization pressure is a literal adversary trying to read our secret messages. The optimizer finds the corner cases where our proxy no longer perfectly captures our intuitive idea of “no information”, and they’re able to extract information about our secret messages. Correlation doesn’t capture our intuitive notion of “information which X contains about Y” well enough for zero correlation to prevent our adversaries from reading our messages.

Mutual information, by contrast, handles the optimization pressure just fine. We intuitively expect that "Y contains zero information about X" is enough to keep our messages secret, even in the presence of adversaries, and the mutual information definition of "information" is indeed enough to match that intuitive expectation.

So… that’s all well and good. We want definitions/metrics which are robust to optimization pressure, rather than proxies which break down. But how do we find robust definitions/metrics in the first place? In the long run, of course, we can try out a metric on lots of different problems, prove lots of different theorems about it, and get an idea of robustness that way. But there are infinitely many possible metrics for any given concept; we don’t have time to go through that whole process for all of them. How do we figure out *in advance* what the robust concept definitions are?

## You Already Know The Answer

A classic quote from famed physicist John Archibald Wheeler: “Never make a calculation until you know the answer”.

In math, it’s very easy to write down some expressions or equations or definitions, and start pushing symbols around, without having any idea what the answer looks like or how to get there. In undergrad math classes, this often works, because the problem is set up so that there’s only a handful of things which you can do at all. In research, we don’t have that guardrail, and we *especially* don’t have that guardrail when finding the right definitions is part of the problem. I have literally spent months pushing symbols around without getting anywhere at all. __Math is a high-dimensional space; brute force search does not work__ [LW · GW].

Bottom line: if we want to get anywhere, we need to already have at least *some* intuition for what we’re looking for, and we need that intuition to guide the search. “Never make a calculation until you know the answer” is the sort of lesson which gets beaten in by months or years of failure to follow it.

Fortunately, we already have a *lot* of intuition to lean on, even without years of mathematical study. For instance, if we look back at the information example from earlier… what are the intuitive arguments for why correlation seems like a reasonable measure of information?

- If there’s no information, then is zero.
- If is one, then there’s perfect information - one variable tells us everything there is to know about the other.

These seemed pretty natural, right? This is exactly what “knowing the answer” looks like - we have some intuition about what properties a measure of “information” should have. In the case of mutual information, the intuition was this argument:

I(X; Y) is exactly the difference between the average number of bits one needs to send in a message in order to communicate the value of X, and the average number of bits one needs to send to communicate X if the receiving party already knows Y. It’s the number of bits “savable” by knowing Y. By imagining different things as the “message” and thinking about how hard it is to

guessX after knowing Y, we can intuitively guess that this metric will apply to lots of different situations…

These are the kinds of intuitions which guide our search in the high-dimensional space of mathematical definitions/metrics.

Note that the engineers’ idea that “data Y contains no information about secret message X” should be sufficient to prevent adversaries from reading the messages is *also* an intuitive property of information. Assuming our intuitions about information are correct (or at least approximately correct), a definition which fully captures our intuitive idea of information should imply this property. If it doesn’t, then either (a) our definition does not fully capture our intuitive idea of information, or (b) our intuition is wrong (in which case we should be able to translate the math back into an intuitive example of how our previous intuition failed).

## … But Have You Fully Specified The Answer?

So, math is high-dimensional, we need intuitions to guide our search. But both the correlation coefficient and mutual information have some intuitive arguments for why they’re good measures of information. What’s the difference? What makes one better than the other?

Let’s go back to the two intuitive arguments for the correlation coefficient:

- If there’s no information, then is zero.
- If is one, then there’s perfect information - one variable tells us everything there is to know about the other.

Key thing to notice: is not the only metric which satisfies these two criteria. For instance, we could exponentiate X and Y and *then* take the correlation, , and both properties still apply. Same with . There’s lots of degrees of freedom here; these two intuitive arguments are not enough to uniquely specify the correlation coefficient as our definition/metric.

By contrast, consider Shannon’s argument:

I(X; Y) is exactly the difference between the average number of bits one needs to send in a message in order to communicate the value of X, and the average number of bits one needs to send to communicate X if the receiving party already knows Y.

This has zero degrees of freedom. This argument (with a couple approximations) is enough to uniquely specify Shannon’s formula for mutual information.

Adam Shimi __gave a great analogy for this__ [LW(p) · GW(p)]: the intuitive arguments are like a set of equations, and the definition/metric is like a solution. Ideally, we want the “equations” to nail down one unique “solution”. If that’s the case, then there’s only one definition compatible with our intuitive arguments. If we intuitively expect some additional properties to hold (e.g. “no information” being sufficient to prevent adversaries from reading our secret messages), then either they have to hold for that one definition, or our intuition is wrong.

On the other hand, if our “equations” have multiple “solutions”, then it’s kind of misleading to pick out one solution and declare that to be our answer. Why that solution? If there’s lots of different definitions/metrics which satisfy the intuitive arguments for correlation, then why not use one of the others? More to the point: how do we know our intuition itself isn’t built around some other metric which satisfies the properties? We believe our intuitive concept satisfies the listed properties, and we believe our intuitive concept satisfies some more general properties as well (e.g. “no information” protecting secret messages”), but that does not mean that *any* random definition compatible with the listed properties is sufficient to imply the more general properties. If we want our intuition to apply, then we need to find the definition/metric which actually corresponds to our intuitive concept (assuming such a definition/metric exists), not just some proxy which satisfies a few of the same properties.

## Recap

We want mathematical definitions/metrics which are robust - in particular, they should not break down when we apply optimization pressure. In the long run, we can verify robustness by using a definition/metric in lots of different problems and proving theorems about it. But math-space is high dimensional, so we need a more efficient way to search for good definitions/metrics.

One main way we do this is to lean on intuitions. We already have intuitive concepts, and we have some beliefs about the properties those concepts should have. If we can accurately translate our intuitive concepts into mathematical definitions/metrics, then they should satisfy the intuitively-expected properties. (Or else our intuitions are wrong, and a good definition/metric should convince us of that when the definition doesn’t satisfy an expected property.)

The key challenge here is to come up with a set of intuitive arguments which *uniquely *specify a particular definition/metric, exactly like a set of equations can uniquely specify a solution. If our arguments have “many solutions”, then there’s little reason to expect that the ad-hoc “solution” we chose actually corresponds to our intuitive concept. If our chosen definition/metric does not correspond to our intuitive concept, then even if our intuition is correct, it shouldn’t be too surprising if the definition/metric fails to have more general properties which we intuitively expect.

In short: if our arguments are not sufficient to uniquely nail down one definition/metric, then we lose our main reason to expect the definition/metric to be robust.

*Thankyou to Adam Shimi for a conversation which led to this post.*

## 52 comments

Comments sorted by top scores.

## comment by Donald Hobson (donald-hobson) · 2021-03-16T17:08:12.034Z · LW(p) · GW(p)

Shannon mutual information doesn't really capture my intuitions either. Take a random number X, and a cryptographically strong hash function. Calculate hash(X) and hash(X+1).

Now these variables share lots of mutual information. But if I just delete X, there is no way an agent with limited compute can find or exploit the link. I think mutual information gives false positives, where Pearson info gave false negatives.

So Pearson Correlation => Actual info => Shannon mutual info.

So one potential lesson is to keep track of which direction your formalisms deviate from reality in. Are they intended to have no false positives, or no false negatives. Some mathematical approximations, like polynomial time = runnable in practice, fail in both directions but are still useful when not being goodhearted too much.

Replies from: johnswentworth, ForensicOceanography## ↑ comment by johnswentworth · 2021-03-16T17:28:00.433Z · LW(p) · GW(p)

This is particularly relevant to the secret messages example, since we do in fact use computational-difficulty-based tricks for sending secret messages these days.

## ↑ comment by ForensicOceanography · 2021-03-18T18:17:16.970Z · LW(p) · GW(p)

Actually the mutual information has some well-defined operational meaning. For example, the maximum rate at which we can transmit a signal through a noisy channel is given by the mutual information between the input and the output of the channel. So it depends on which task you are interested in.

Replies from: donald-hobson## ↑ comment by Donald Hobson (donald-hobson) · 2021-03-19T16:55:38.974Z · LW(p) · GW(p)

A "channel" that hashes the input has perfect mutual info, but is still fairly useless to transmit messages. The point about mutual info is its the maximum, given unlimited compute. It serves as an upper bound that isn't always achievable in practice. If you restrict to channels that just add noise, then yeh, mutual info is the stuff.

Replies from: ForensicOceanography## ↑ comment by ForensicOceanography · 2021-03-22T12:43:44.200Z · LW(p) · GW(p)

Yes, it is the relevant quantity in the limit of infinite number of uses of the channel. If you can use it just one time, it does not tell you much.

## comment by Zack_M_Davis · 2021-03-16T06:58:06.088Z · LW(p) · GW(p)

This is related to something I never quite figured out in my cognitive-function-of-categorization quest. How do we quantify how good a category is at "carving reality at the joints"?

Your first guess would be "mutual information between the category-label and the features you care about" (as suggested in the Job parable in April 2019's "Where to Draw the Boundaries?" [LW · GW]), but that actually turns out to be wrong, because information theory has no way to give you "partial credit" for getting *close* to the right answer, which we want. Learning whether a number between 1 and 10 inclusive is even or odd gives you the same amount of information (1 bit) as learning whether it's over or under 5½, but if you need to make a decision whose goodness depends continuously on the magnitude of the number, then the high/low category system is useful and the even/odd system is not: we care about putting probability-mass "close" to the right answer, not just assigning more probability to the *exact* answer.

In January 2021's "Unnatural Categories Are Optimized for Deception" [LW · GW], I ended up going with "minimize expected squared error (given some metric on the space of features you care about)", which seems to work, but I didn't have a *principled* justification for that choice, other than it solving my partial-credit problem and it being traditional. (Why not the absolute error? Why not exponentiate this feature and then, *&c.*?)

Another possibility might have been to do something with the Wasserstein metric, which reportedly fixes the problem of information theory not being able to award "partial credit". (The logarithmic score is the special case of the Kullback–Leibler divergence when the first distribution assigns Probability One to the actual answer, so if there's some sense in which Wasserstein generalizes Kullback–Leibler for partial credit, then maybe that's what I want.)

My intuition doesn't seem adequate to determine which (or something else) formalization captures the true nature of category-goodness, to which other ideas are a mere proxy.

Replies from: johnswentworth## ↑ comment by johnswentworth · 2021-03-16T16:27:09.077Z · LW(p) · GW(p)

(This is a good example. I'm now going to go on a tangent mostly unrelated to the post.)

I think you were on the right track with mutual information. They key insight here is not an insight about what metric to use, it's an insight about the structure of the world and our information about the world.

Let's use this example:

Learning whether a number between 1 and 10 inclusive is even or odd gives you the same amount of information (1 bit) as learning whether it's over or under 5½, but if you need to make a decision whose goodness depends continuously on the magnitude of the number, then the high/low category system is useful and the even/odd system is not...

*Why* do we care about how big this number is, as opposed to even/odd? Let's make the example a bit more concrete: we have a building whose walls are built from bricks, and the number of interest is how-many-bricks-tall the walls are. (Or, if you want to be *really* concrete, assume the wall is made of concrete blocks rather than bricks.)

Key thing to notice: in general, a lot more things are going to depend on the rough height of the wall than on the parity of the bricks, especially things far away from the wall itself (i.e. not just the motions of air molecules right next to the wall). It's the rough height (i.e. the most-significant-bits) which is relevant to things like whether a tall person will hit their head on the ceiling, whether a bookshelf will fit, whether the building casts a shadow on the neighbor's yard, whether I can see the building from somewhere far away, etc. By contrast, brick-parity is much less relevant to things elsewhere. If a wall is 457 inches tall, then the "4" in the hundreds place gives us more information about more other things in the world than the "7" in the ones place.

Generalizing the idea: it's not that we care directly about how-many-bricks-tall a wall is. That is not a terminal value. If we care more about the rough wall-height than about brick-parity, that's because the rough wall-height is more relevant to the other things which we care about in the world. And that, in turn, is because the rough wall-height is more relevant to more things *in general*. Information about brick-parity just doesn't propagate very far in the causal graph of the world; it's quickly wiped out by noise in other variables. Rough wall-height propagates further.

## ↑ comment by Donald Hobson (donald-hobson) · 2021-03-16T18:06:49.985Z · LW(p) · GW(p)

No, its not just about the information, its about information, our utility function, and our epistemic capabilities. Suppose I had taken ultra high resolution electron microscope images of one particular brick in the wall. And burried the hard drives on the moon. Most of the information about the wall that isn't near the wall is the hard drives. But if you are trying to reach the top, and want to know how big a ladder to get, you still don't care about my electron microscope images.

Humans don't track the entire causal graph. We just track the fragments that are most important to achieving our utility function, given our mental limitations. A superintelligent AI might be able to track consequences of brick parity all over the place. All we know is that we can't track it very far. If we are too far from the wall to see the brick parity, we can't track it.

Information about brick-parity just doesn't propagate very far in the causal graph of the world; it's quickly wiped out by noise in other variables.

How do you distinguish the info not being there, from you being unable to see it? A function can be perfectly deterministic, but seem random to you because you can't compute it.

Replies from: johnswentworth## ↑ comment by johnswentworth · 2021-03-16T18:51:29.109Z · LW(p) · GW(p)

The problem with the hard-drive example is that the information is only on that one hard drive, buried somewhere on the moon. It's not about *how much* information is relevant far away, it's about *how many* different far-away places the information is relevant. Information which is relevant to many different neighborhoods of far-away variables is more likely to be relevant to something humans care about (because it's relevant to many things); information which is relevant to only a few far-away chunks of variables is less likely to touch anything humans care about.

What makes wall-height interesting is that it's relevant to a lot of different variables in the world - or, equivalently, we can learn something about the wall-height by observing many different things from many different places. If I'm standing on the lawn next door, look down and see the building's shadow, then I've gained info about the building height. If I'm looking at the block from far away, and see the building over the surrounding buildings, I've learned something about the height. If I'm moving a couch around inside the building, and find that I have enough space to stand the couch on its end, then I've learned something about the height.

To put it differently: I can learn about the height from many different vantage points.

A toy model I use to study this sort of thing: we have a sparse causal network of normal variables. Pick one neighborhood of variables in this network, and calculate what it tells you about the variables in some other neighborhood elsewhere in the network. The main empirical result is that, if we fix one neighborhood X and ask what information we can gain about X by examining many different neighborhoods , then it turns out that most of the neighborhoods Y contain approximately-the-same information about X. (Specifically: we can apply a singular vector decomposition to the covariance matrix of X with each of the Y's, and it turns out that it's usually low-rank and that the X-side singular vectors are approximately the same for a wide variety of Y's.) I'll have a post on this at some point.

In the hard drive example, the information is only in one little chunk of the world. (Well, two little chunks: the hard drive and the original brick.) By contrast, information about the wall height is contained in a wide(r) variety of other variables in other places.

How do you distinguish the info not being there, from you being unable to see it? A function can be perfectly deterministic, but seem random to you because you can't compute it.

Well, at least in the toy models, I have can calculate exactly what information is available, and I do expect the key assumptions of these toy models to carry over to the real world. More generally, for chaotic systems (including e.g. motions of air molecules), we know that information is quickly wiped out given any uncertainty at all in the initial conditions.

If my only evidence were "it looks random", then yes, I'd agree that's weak evidence. Things we don't understand look random, not mysterious. But we do have theory backing up the idea that information is quickly wiped out in the real world, given even very small uncertainty in initial conditions.

## comment by Chris_Leong · 2021-03-16T22:53:49.806Z · LW(p) · GW(p)

I really appreciate this post.

I've thought for a long time that people on this website have a bias to just throw down some maths without asking about the philosophical assumptions behind it. And then once they've produced something beautiful, loss aversion makes them reluctant to question it too deeply.

(This is why I've focused my attention on Agent Meta-Foundations [LW · GW]).

## comment by CTVKenney · 2021-03-16T01:12:59.584Z · LW(p) · GW(p)

The VARIANCE of a random variable seems like one of those ad hoc metrics. I would be *very* happy for someone to come along and explain why I'm wrong on this. If you want to measure, as Wikipedia says, "how far a set of numbers is spread out from their average value," why use E[ (X - mean)^2 ] instead of E[ |X - mean| ], or more generally E[ |X - mean|^p ]? The best answer I know of is that E[ (X - mean)^2 ] is easier to calculate than those other ones.

## ↑ comment by AlexMennen · 2021-03-16T05:03:03.549Z · LW(p) · GW(p)

Variance has more motivation than just that it's a measure of how spread out the distribution is. Variance has the property that if two random variables are independent, then the variance of their sum is the sum of their variances. By the central limit theorem, if you add up a sufficiently large number of independent and identically distributed random variables, the distribution you get is well-approximated by a distribution that depends only on mean and variance (or any other measure of spreadout-ness). Since it is the variance of the distributions you were adding together that determines this, variance is exactly the thing you care about if you want to know the degree of spreadout-ness of a sum of a large number of independent variables from the distribution. If you take any measure of how spread out a distribution is that doesn't carry the same information as the variance, then it will fail to predict how spread out the sum of a large number of independent copies of the distribution is, by any measure.

Edit: On the subject of other possible measures of features of probability distributions, one could also make the same complaint about mean as a measure of the middle of a distribution, when there are possible alternatives like median. Again, a similar sort of argument can be used to identify mean as the best one in some circumstances. But if I were to define a measure of how spread out a distribution is as E[|X-m|] for some m, I would use m=median rather than m=mean. This is because m=median minimizes this expected absolute value (in fact, median can be defined this way), so this measures the minimal average distance every point in the distribution has to travel in order for them to all meet at one point (the median is the most efficient point for them to meet).

Replies from: johnswentworth## ↑ comment by johnswentworth · 2021-03-16T15:57:16.500Z · LW(p) · GW(p)

Good point about the central limit theorem. Two nitpicks, though.

By the central limit theorem, if you add up a sufficiently large number of independent and identically distributed random variables, the distribution you get is well-approximated by a distribution that depends only on mean and variance (or any other measure of spreadout-ness)

The "or any other measure of spreadout-ness" can be dropped here; viewing the normal distribution through the lens of either the principle of maximum entropy or sufficient statistics tells us that it is variance specifically which is relevant, and any spread-metric not isomorphic to variance will be a leaky abstraction. (Leaky meaning that it will not capture all the relevant information about the spread, whereas variance does capture all the information, in a formal sense: it's a sufficient statistic.)

But if I were to define a measure of how spread out a distribution is as E[|X-m|] for some m, I would use m=median rather than m=mean. This is because m=median minimizes this expected absolute value (in fact, median can be defined this way)...

I don't think this is right. Suppose I have a uniform distribution over a finite set of X-values. The value of m minimizing E[|X-m|] should change if I decrease the minimum X-value a lot, while leaving everything else constant, but the median would stay the same.

I think the measure which would produce median is E[1 - 2 I[X>m]], where I[.] is an indicator function?

Replies from: AlexMennen## ↑ comment by AlexMennen · 2021-03-16T18:29:20.764Z · LW(p) · GW(p)

The "or any other measure of spreadout-ness" can be dropped here

What I meant is that, if you restrict attention to normal distributions with a fixed mean, then any reasonable measure of how spread out it is (including any of the E[|x-mean|^p]) will be a sufficient statistic, because any such measure, in order to be reasonable, must increase as variance increases (for normal distributions), so this function can be inverted to recover the variance. In other words, any other such measure will indeed be isomorphic to variance when restricted to normal distributions.

The value of m minimizing E[|X-m|] should change if I decrease the minimum X-value a lot, while leaving everything else constant

This does not change the minimizer of E[|X-m|] because it increases E[|X-m|] by the same amount for every m>min(X).

In general, you can't decrease E[|X-m|] by moving m from median to median-d for d>0 because, for xmedian (half the distribution), you increase |X-m| by d, and for the other half, you decrease |X-m| by at most d.

Replies from: Simuon, johnswentworth, Simuon## ↑ comment by Simuon · 2021-03-17T16:57:03.039Z · LW(p) · GW(p)

I don't agree with the argument on the variance :

"Any other such measure will indeed be isomorphic to variance when restricted to normal distributions."

It's true, but you should not restrict to normal distributions in this context. It is possible to find some distributions X1 and X2 with different variances but same value E(|x-mean|^p) for . Then X1 and X2 looks the same to this p-variance, but their normalized sample average will converge to different normal distributions. Hence variance is indeed the right and only measure of spreadout-ness to consider when applying the central limit theorem.

Replies from: AlexMennen## ↑ comment by AlexMennen · 2021-03-18T00:25:13.189Z · LW(p) · GW(p)

That's exactly what I was trying to say, not a disagreement with it. The only step where I claimed all reasonable ways of measuring spreadout-ness agree was on the result you get after summing up a large number of iid random variables, not the random variables that were being summed up.

## ↑ comment by johnswentworth · 2021-03-16T18:56:53.304Z · LW(p) · GW(p)

Ah, these make sense. Thanks.

## ↑ comment by Simuon · 2021-03-17T16:51:48.789Z · LW(p) · GW(p)

I don't agree with the argument on the variance :

"Any other such measure will indeed be isomorphic to variance when restricted to normal distributions."

It's true, but you should not restrict to normal distributions in this context. It is possible to find some distributions X1 and X2 with different variances but same value E(|x-mean|^p) for $p\neq 2$. When applying the central limit theorem to samples of X1 and X2, only the variance will tell you which of the normal distributions to consider. Variance is indeed the right and only measure of spreadout-ness to consider.

## ↑ comment by Adele Lopez (adele-lopez-1) · 2021-03-16T19:21:25.185Z · LW(p) · GW(p)

Maybe entropic uncertainty (conjectured by Everett as part of his "Many Worlds" thesis, and proved by Hirschmann and Beckner) is along the lines of what you're looking for. It's a generalization of the Heisenberg uncertainty principle that applies even when the variance isn't well defined.

## comment by AllAmericanBreakfast · 2021-03-15T23:44:13.131Z · LW(p) · GW(p)

Pushing symbols around (or the metaphorical equivalent) might be necessary for building intuitions. Incidentally, they call novice chess players *woodpushers*.

When I teach piano improv, I show my students a way of making decent-sounding notes that takes about 30 seconds to explain. And then I have them play that way for a long time, only occasionally adding complexity - a new voicing, a different chord order.

Likewise, to train a neural network, you let it make blind guesses, then tell it how it did, until it gets very good at finding the right answer. There seems to be a lot of value in messing around as a form of training.

I get the sense that in your model, blind calculation is a *distraction *from intuitive problem solving. In my model, blind calculation *builds *intuition for the problem. In both our models, intuitive problem-solving must be followed by proof, which is the step that Pearson skipped.

## ↑ comment by johnswentworth · 2021-03-16T01:36:12.643Z · LW(p) · GW(p)

I agree with this to some extent. Playing around with the symbols is useful for getting a general intuition for what-sorts-of-results-are-easy. It's a good way to find things which are "nearby" in the expression-manipulation graph, and to notice patterns in the general structure of that graph. Where it usually doesn't suffice is harder problems, where you have to go pretty "far away" in the expression-graph or find the right possibility in a very large space. That's where the exponentially large number of possibilities really kick in, and we need more powerful tools.

So I agree that playing around is often a useful way to build intuition for some aspects of the problem, and sometimes even a necessary step, but it usually isn't sufficient for harder problems.

Replies from: AllAmericanBreakfast## ↑ comment by AllAmericanBreakfast · 2021-03-16T02:48:40.158Z · LW(p) · GW(p)

We talked about this issue in the comments on this post of yours [LW · GW] 9 months ago :)

I get the sense that you're considering problems where an open-ended search doesn't tell you if you're heading in the right direction.

So for example, if we play Marco Polo, when I shout "Marco" and everybody else shouts "Polo," this gives me some information about where they are. They might move, but they can't necessarily move fast enough to avoid me.

If we changed the game so people only have to *whisper* "Polo," I rarely, if ever, will gain information from shouting "Marco," and will mostly be stumbling around in the dark.

I might need some pretty sophisticated thinking to tag somebody. Perhaps I'd consider the shape of the swimming pool, who's likely to try sneaking up behind me just for fun, who's likely to get bored and forget that I'm still out to tag them, my energy level, and perhaps shift to quietly swimming around while listening for the splashes of the other players.

And these are not considerations that are immediately suggested by the "rules of the game," which encourage you to think in terms of the actions: swimming, shouting "Marco," and tagging. A kid who operated by just experimenting with when they should shout "Marco" and swimming around randomly is extremely unlikely to arrive at the precise manner of playing the game that might let them tag somebody in *whisper Marco Polo.*

A problem statement can prime us to think in terms of misleading similarities (i.e. using a correlation coefficient as a proxy for information), or non-strategic movement (i.e. making a lot of noise splashing around and shouting "Marco" in a way that makes it hard to hear where others are).

Shifting our focus from babbling *within* constraints to babbling *about* constraints seems to be a useful move.

Babbling *within* constraints examples:

- Pushing chess pieces around according to the rules of the game
- Tracing random routes through a maze
- Playing arbitrary notes over a I-V-vi-IV chord progression
- Swimming around and shouting "Marco" as fast as you can
- Shifting materials in such a way as to form a platform crossing the river

Babbling *about *constraints examples:

- Imagining principles that might be helpful ("move my pieces closer to the other king," "put pieces where they have the largest number of possible moves," "check to make sure they can't take your piece for free on the next move").
- Look for areas of the maze that you definitely
*don't*need to move through. - Periodically changing the register, scrambling the order of the chords every 8 measures
- Trying a tactic of standing still, listening for swimmers, and then surprise-lunging; not shouting "Marco" until you hear the others close by
- Observing the height of the river over time, in order to determine how high and wide it gets when it floods.

I notice that I tend to alternate between these two modes. It's often quite useless for me to force myself to come up with constraints for a problem I haven't tried messing around with yet. Likewise, at a certain point, messing around becomes obviously fruitless, and imposing constraints becomes more productive.

Replies from: johnswentworth## ↑ comment by johnswentworth · 2021-03-16T03:05:56.362Z · LW(p) · GW(p)

That's an awesome analogy/example. Well done.

Also, this comment would only need a little more flesh on it to be a great post in its own right.

Replies from: AllAmericanBreakfast## ↑ comment by AllAmericanBreakfast · 2021-03-16T03:35:52.154Z · LW(p) · GW(p)

Thanks, John :) Your post has been thought-provoking for me (three top-level comments so far), so that might have to happen!

## comment by AllAmericanBreakfast · 2021-03-15T23:17:49.855Z · LW(p) · GW(p)

Here's an example to show where Pearson's approach goes wrong:

A Q comes in two types: an X or a Y.

- An X comes in two types: A1 or B2.
- A Y comes in two types: A2 or B1.

You need to **reveal** the A/B- and 1/2-data for your Qs, while keeping a **secret **whether the Q is an X or a Y.

The A vs. B and 1 vs. 2 properties of a Q are uncorrelated with whether it's an X or a Y. You can reveal one or the other, but not both, while keeping the secret. If you reveal both, even though neither piece of information is correlated with the secret identity, you reveal the secret.

Replies from: svyatoslav-usachev-1## ↑ comment by Svyatoslav Usachev (svyatoslav-usachev-1) · 2021-03-17T22:28:41.596Z · LW(p) · GW(p)

No, that's not what's wrong with Pearson's approach. Your example suffers from a different issue.

Replies from: AllAmericanBreakfast## ↑ comment by AllAmericanBreakfast · 2021-03-17T23:03:35.460Z · LW(p) · GW(p)

Can you give an example to explain? It's the best example I could give based on the description in the OP.

Replies from: svyatoslav-usachev-1## ↑ comment by Svyatoslav Usachev (svyatoslav-usachev-1) · 2021-03-25T22:54:12.712Z · LW(p) · GW(p)

What you are describing is data (A/B, 1/2) such that *parts* of the data are *independent* from the secret X/Y, but the *whole* data is *not independent* from the secret. That's an issue that is sort of unusual for any statistical approach, because it should be clear that only the *whole* leaked data should be considered.

The problem with Pearson correlation criterion is that it does not measure *independence* at all (even for parts of the data), but measures *correlation* which is just a single statistic of the two variables. It's as if you compared two distributions by comparing their means.

Let's say *leaked *data is X = -2, -1, 1, 2 equiprobably, and *secret* data is Y = X^2. Zero correlation *just* implies E(XY) - E(X)E(Y) = 0, which is the case, but it is clear that one can fully restore the secret from the leaked, they are not independent at all.

See more at https://en.wikipedia.org/wiki/Correlation_and_dependence#Correlation_and_independence

## comment by ryan_b · 2021-03-17T15:40:59.388Z · LW(p) · GW(p)

This post reminds me of non-standard analysis.

The story goes like this: in the beginning, Leibniz and Newton developed calculus using infinitesimals, which were intuitive but had no rigorous foundation (which is to say, ad-hoc). Then, the calculus was developed, which meant using limits instead, and they had a rigorous foundation. Then, despite much wailing and gnashing of teeth from students ever after, infinitesimals were abandoned for centuries. Suddenly came the 1960s, when Abraham Robinson provided a rigorous formalism for infinitesimals, which gave birth to non-standard (using infinitesimals) analysis to be contrasted with standard (using ) analysis.

So now there is continuous low-grade background fight going on where people do work in non-standard analysis but then have to convert it into standard analysis to get published, and the non-standards say theirs is intuitive and the standards say theirs is formally just as powerful and everyone knows it already so doing it any other way is stupid.

The way this relates to the post is claims like the following, about why non-standard analysis can generate proofs that standard analysis (probably) can't:

The reason for this confusion is that the set of principles which are accepted by current mathematics, namely ZFC, is much stronger than the set of principles which are actually used in mathematical practice. It has been observed (see [F] and [S]) that almost all results in classical mathematics use methods available in second order arithmetic with appropriate comprehension and choice axiom schemes. This suggests that mathematical practice usually takes place in a conservative extension of some system of second order arithmetic, and that it is difficult to use the higher levels of sets. In this paper we shall consider systems of nonstandard analysis consisting of second order nonstandard arithmetic with saturation principles (which are frequently used in practice in nonstandard arguments). We shall prove that nonstandard analysis (i.e. second order nonstandard arithmetic) with the -saturation axiom scheme has the same strength as third order arithmetic. This shows that in principle there are theorems which can be proved with nonstandard analysis but cannot be proved by the usual standard methods. The problem of finding a specific and mathematically natural example of such a theorem remains open. However, there are several results, particularly in probability theory, whose only known proofs are nonstandard arguments which depend on saturation principles; see, for example, the monograph [Ke].

Experience suggests that it is easier to work with nonstandard objects at a lower level than with sets at a higher level.This underlies the success of nonstandard methods in discovering new results. To sum up,nonstandard analysis still takes place within ZFC, but in practice it uses a larger portion of full ZFC than is used in standard mathematical proofs.

Emphasis mine. This is from On the Strength of Nonstandard Analysis, a 1986 paper by C Ward Henson and H Jerome Keisler. I found the paper and this part of the quote from a StackOverflow answer. Note: you will probably have to use sci-hub unless you have Cambridge access; the find-free-papers browser tools seem to mismatch this with another later paper by Keisler with an almost identical title.

I now treat this as a pretty solid heuristic: when it comes to methods or models, when people *say* it is intuitive, they *mean* that it chunks at least some stuff at a lower level of abstraction. Another math case with similar flavor of claim is Hestenes' Geometric Algebra, which mostly does it by putting the geometric structure at the foundation, which allows humans to use their pretty-good geometric intuition throughout. This pays out by tackling some questions previously reserved for QM with classical methods, among other neat tricks.

For the record I do not know how to do non-standard analysis even a little; I only ever knew what it was because it gets a footnote in electrical engineering as "that thing that let us figure out how to convert between continuous and discrete time."

Replies from: johnswentworth## ↑ comment by johnswentworth · 2021-03-17T16:20:38.903Z · LW(p) · GW(p)

This is a great comment.

I disagree with this particular line, though I don't think it messes up your general point here (if anything it strengthens it):

The story goes like this: in the beginning, Leibniz and Newton developed calculus using infinitesimals, which were intuitive but had no rigorous foundation (which is to say, ad-hoc).

Part of the point of the post is that ad-hoc-ness is not actually about the presence or absence of rigorous mathematical foundations; it's about how well the mathematical formulas we're using match our intuitive concepts. It's the correspondence to intuitive concepts which tells us how much we should expect the math to generalize to new cases which our intuition says the concept should generalize to. The "arguments" which we want to uniquely specify our formulas are not derivations or proofs from ZFC, they're intuitive justifications for why we're choosing these particular *definitions*.

So I'd actually say that infinitesimals were *less* ad-hoc, at least at first, than epsilon-delta calculus.

This also highlights an interesting point: ad-hoc-ness and rigorous proofs are orthogonal. It's possible to have the right formulas for our intuitive concepts *before* we know exactly what rules and proofs will make it fully rigorous.

## ↑ comment by ryan_b · 2021-03-17T18:15:21.598Z · LW(p) · GW(p)

Highlighting the difference between ad-hoc-ness and rigor was what I was trying to do when I emphasized that element, though I shoulda put the parentheses between the intuition and rigor section. The implicit assumption I made, which I should probably make explicit, is that if we have something which matches our intuitive concepts well *and* has a rigorous foundation then I expect it to dominate other options (both in terms of effectiveness and popularity).

Fleshing out the assumption a bit: if you made a 2x2 graph with ad-hoc as the x axis and rigor as the y axis, the upper right quadrant is the good stuff we use all the time; the upper left quadrant is true-but-useless, the bottom left quadrant is ignored completely, and the bottom right quadrant of high ad-hoc but low rigor is where all the action is (in the sense of definitions that might be really useful and adopted in the future).

The infinitesimal vs limits case seems like an example: good intuition match and poor rigor was replaced with acceptable intuition match and good rigor. However, it is a bit messy - I'm zeroing in on the infinitesimals vs limits as methods rather than definitions per se, or something like the presentation of the fundamental theorem of calculus.

I quite separately took the liberty of assuming the same logic you are applying to definitions could be applied to the rest of mathematical architecture, like methods, algorithms, notation, and so on. I admit this introduces quite a bit of fuzz.

## comment by ryan_b · 2021-03-17T13:30:39.020Z · LW(p) · GW(p)

Meta: I like the use of the Recap and In Short sections of the post, which in my opinion did an excellent job of summarizing and then testing whether I followed the post correctly. By this I mean that if the In Short section wasn't relatively clear, than I missed something important about the post.

## comment by AllAmericanBreakfast · 2021-03-16T03:35:02.660Z · LW(p) · GW(p)

Intuition can give us a rough sense of how much a constraint idea limits the size of the search space, and how likely it is to be valid.

The right way to go is to brainstorm a large number of possible constraint ideas, and use intuition to prioritize them for proof and eventual rollout of a solution.

The wrong way to go is to jump straight from idea to proof, or even worse straight from idea to execution, as happened in Pearson's case. What leads us down this false path? Mathematical symbols, famous names, being a tall person with a deep voice, you know the drill.

## comment by TurnTrout · 2021-03-15T22:55:07.700Z · LW(p) · GW(p)

The key challenge here is to come up with a set of intuitive arguments which

uniquelyspecify a particular definition/metric, exactly like a set of equations can uniquely specify a solution.

For another perspective on this challenge: given the math, can you infer the intuitive concept it's formalizing [LW · GW]?

## comment by Simuon · 2021-03-17T20:30:21.815Z · LW(p) · GW(p)

Is there a reference on the events with the Bell labs ? I can imagine some scenarii where the military transmits some information an can sort of engineer what the adverse party can read (for example Eve can read the power supply of some device, Alice must then add sufficient noise on the power supply), but none seems realistic in the context.

Replies from: johnswentworth## ↑ comment by johnswentworth · 2021-03-17T20:49:38.737Z · LW(p) · GW(p)

The story is fictional, not historical.

Replies from: Simuon## ↑ comment by Simuon · 2021-03-18T12:31:52.546Z · LW(p) · GW(p)

I was suspecting it was more of a fable, but I hoped it was historical (there are many true cryptographic stories in this style, though I don't now any about this "proxy" problem). I think it is a bit dangerous to draw conclusions from a fictional story, though the fable made this post more engaging, and I think I mostly agree with its conclusion.

Why is using a fable to construct an argument dangerous ? Suppose Aesope wrote a fable on some goose laying golden eggs, and people draw the conclusion that you should not experiment around positive phenomena in fear to lose what you got. Later, Aristotle understood that science is actually good. He advised Alexander to be curious, then Alexander cut the Gordian knot and became "the Great".

Well, this was meta and self-defeating. But here is a more interesting story: economists usually tell a fable about the birth of currency, emerging from barter to solve the problem of the double coincidence of wants (Jevons, 1875). This is a great thought experiment, but it is too often seen as a realistic description of how money was invented, while there is anthropological and historical evidence that money was at first issued by the state, and considered as a debt token rather than having value in itself (Graeber). Framing the thought experiment as historical results in a public discourse where hyperinflation is waved as the inevitable consequence of state-issued money. The conclusion I draw from this story is that thought experiment should not be framed as historical stories, because it prevents us from seeing other aspects of the problem.

Does this apply to the post ? I'm not sure... the fable is not really framed as historical; what the rest of the argument needs from the story is mostly that Pearson's correlation is misleading while Shannon's mutual information is on point. Maybe we can open some interesting perspectives by looking at historical examples where the correlation is mistakenly used in place of mutual information. The point that uniqueness is an interesting proxy for robustness stands still; I think it could be developed into a more general discussion around the advantages of uniqueness in a metric.

## comment by newcom · 2021-03-17T12:42:43.524Z · LW(p) · GW(p)

The key challenge here is to come up with a set of intuitive arguments which

uniquelyspecify a particular definition/metric, exactly like a set of equations can uniquely specify a solution. If our arguments have “many solutions”, then there’s little reason to expect that the ad-hoc “solution” we chose actually corresponds to our intuitive concept.

Maybe I'm missing something in the post, but why is this the case? Isn't it arbitrary to suppose that only one possible metric exists that fully 'solves' the problem?

Replies from: johnswentworth## ↑ comment by johnswentworth · 2021-03-17T16:35:12.937Z · LW(p) · GW(p)

Good question, I was hoping someone would ask this. There's some subtleties here that I didn't want to unpack in the post.

Sometimes, I use a formula to specify things other than points. Like, I could use a formula to specify a line (e.g. y = 3x+2) or a sphere (x^2 + y^2 + z^2 = 1). These equations have "more than one solution" in the sense that there are many points which satisfy them. However, I'm not actually trying to specify one particular point; I'm trying to specify the whole set of points which satisfies the equation (i.e. the line or the sphere). And the equations *do* fully specify those sets of points.

In general, any set of equations fully specifies *some* set of solutions (possibly the empty set).

The interesting question is whether the set-of-solutions-specified actually matches our intuitive concept. If not, then we have no reason to expect that the set-of-solutions will generalize in the ways we expect our intuitive concept to generalize.

Now let's go back to the idea of ad-hoc-ness. Suppose I give some intuitive argument that my concept should satisfy the formula x^2 + y^2 + z^2 = 1. But I also think that the concept-I-want-to-specify is a circle, not a sphere; so this formula alone is not sufficient to nail it down. If I were to arbitrarily choose the circle given by the equations (x^2 + y^2 + z^2 = 1, z = 4x - y), then that would be an ad-hoc specification; I have no reason to expect that particular circle to match my intuitive concept.

Then there's the question of why I should expect my intuitions to nail down *one particular* circle. That's something which would have to have an intuitive argument in its own right. But even if it's not picking one particular circle, there is still *some* set of answers which match my intuition (e.g. a set of circles). If we want our formula to generalize in the cases where we intuitively expect generalization (and fail to generalize in the cases where we intuitively expect failure of generalization), then we do need to match that set.

This argument generalizes, too. Maybe someone says "well, my intuitions are fuzzy, I don't expect a sharp boundary between things-which-satisfy-them and things-which-don't". And then we say ok, we have mathematical ways of handling fuzziness (like probabilities, for instance), we should find a formulation for which the mathematical fuzz matches the intuitive fuzz, so that it will fuzzily-generalize when we expect it to do so. Etc.

## comment by Adele Lopez (adele-lopez-1) · 2021-03-16T19:26:49.856Z · LW(p) · GW(p)

Something that bothers me about the Shannon entropy is that we know that it's not the most fundamental type of entropy there is, since the von Neumann entropy is more fundamental.

A question I don't have a great answer for: How could Shannon have noticed (a priori) that it was even possible that there was a more fundamental notion of entropy?

Replies from: johnswentworth, Simuon## ↑ comment by johnswentworth · 2021-03-16T20:00:17.431Z · LW(p) · GW(p)

I don't think I'd call von Neumann entropy "more fundamental". After all, it only applies to quantum-mechanical universes, whereas Shannon applies to a much wider variety of universes. And to the extent that von Neumann entropy is itself interpretable as a Shannon entropy (which is how I usually think of it), Shannon also applies to this universe.

Replies from: adele-lopez-1## ↑ comment by Adele Lopez (adele-lopez-1) · 2021-03-16T21:04:52.690Z · LW(p) · GW(p)

Shannon entropy is straightforwardly a special case of von Neumann entropy, so it applies to at least as many kinds of universes.

I still feel a bit confused about the "fundamentalness", but in trying to formulate a response, I was convinced by Jaynes that von Neumann entropy has an adequate interpretation in terms of Shannon entropy.

Replies from: johnswentworth## ↑ comment by johnswentworth · 2021-03-16T23:29:57.942Z · LW(p) · GW(p)

Shannon entropy is straightforwardly a special case of von Neumann entropy, so it applies to at least as many kinds of universes.

Only if we're already representing the universe in terms of quantum basis states. We can always take e.g. a deterministic universe and represent it in terms of pure states with zero entanglement, but that still involves a typecasting operation on our universe-state.

That's the real issue here: von Neumann entropy comes with an assumption about the type signature of our universe-state, while Shannon entropy doesn't - it's just using plain old probability, and we can stick probability distributions on top of whatever we please.

Replies from: adele-lopez-1## ↑ comment by Adele Lopez (adele-lopez-1) · 2021-03-16T23:49:53.038Z · LW(p) · GW(p)

This doesn't make sense to me. It seems that if you're being strict about types, then "plain old probabilities" also require the correct type signature, and by using Shannon entropy you are still making an implicit assumption about the type signature.

Replies from: johnswentworth## ↑ comment by johnswentworth · 2021-03-16T23:57:13.658Z · LW(p) · GW(p)

What's the type signature of the things over which we have a probability distribution?

Replies from: adele-lopez-1## ↑ comment by Adele Lopez (adele-lopez-1) · 2021-03-17T00:05:02.337Z · LW(p) · GW(p)

Things that you can cast as a finite set. You can stretch this a bit by using limits to cover things that can be cast as compact metric spaces (and probably somewhat more than this), but this requires care and grounding in the finite set case in order to be unambiguously meaningful.

Replies from: johnswentworth## ↑ comment by johnswentworth · 2021-03-17T01:17:21.610Z · LW(p) · GW(p)

Ok, I see what you're picturing now. That's the picture we get if we approach probability through the Kolmogorov axioms. We get a different picture if we approach it through Cox' theorem or logical inductors: these assign probabilities to sentences in a logic. That makes the things-over-which-we-have-a-probability-distribution extremely general - basically, we can assign probabilities to any statements we care to make about the universe, regardless of the type signature of the universe state.

Replies from: adele-lopez-1, gworley## ↑ comment by Adele Lopez (adele-lopez-1) · 2021-03-17T03:05:18.756Z · LW(p) · GW(p)

Ah, that makes sense, thanks! I'd still say "sentences in a logic" is a specific type though.

Replies from: johnswentworth## ↑ comment by johnswentworth · 2021-03-17T03:40:02.890Z · LW(p) · GW(p)

Definitely, yes. The benefit is that it avoids directly specifying the type of the world-state.

## ↑ comment by G Gordon Worley III (gworley) · 2021-03-17T02:24:20.623Z · LW(p) · GW(p)

Isn't this just begging the question, though, by picking up an implicit type signature via the method by which probabilities are assigned? Like, if we lived in a different universe that followed different physics and had different math I'm not convinced it would all work out the same.

Replies from: johnswentworth## ↑ comment by johnswentworth · 2021-03-17T16:22:23.193Z · LW(p) · GW(p)

If the physics were different, information theory would definitely still be the same - it's math, not physics. As for "different math", I'm not even sure what that would mean or if the concept is coherent at all.

## ↑ comment by Simuon · 2021-03-18T12:47:48.856Z · LW(p) · GW(p)

I think the merit of Shannon was not to define entropy, but to understand the operational meaning of entropy in terms of coding a message with a minimal number of letters, leading to the notion of the capacity of a channel of communication, of error-correcting code and of "bit".

Von Neumann's entropy was introduced before Shannon's entropy (1927, although the only reference I know is von Neumann's book from 1932). It was also von Neumann how suggested the name "entropy" for the quantity that Shannon found. What Shannon could've noticed was that von Neumann's entropy also has an operational meaning. But for that, he would've had to be interested in the transmission of quantum information by quantum channels, ideas that were not around at the time.