# Gelman Against Parsimony

post by lukeprog · 2013-11-24T15:23:32.773Z · LW · GW · Legacy · 24 commentsIn two posts, Bayesian stats guru Andrew Gelman argues against parsimony, though it seems to be favored 'round these parts, in particular Solomonoff Induction and BIC as imperfect formalizations of Occam's Razor.

Gelman says:

I’ve never seen any good general justification for parsimony...

Maybe it’s because I work in social science, but my feeling is: if you can approximate reality with just a few parameters, fine. If you can use more parameters to fold in more information, that’s even better.

In practice, I often use simple models–because they are less effort to fit and, especially, to understand. But I don’t kid myself that they’re better than more complicated efforts!

My favorite quote on this comes from Radford Neal‘s book,

Bayesian Learning for Neural Networks, pp. 103-104: "Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well."...

...ideas like minimum-description-length, parsimony, and Akaike’s information criterion, are particularly relevant when models are estimated using least squares, maximum likelihood, or some other similar optimization method.

When using hierarchical models, we can avoid overfitting and get good descriptions without using parsimony–the idea is that the many parameters of the model are themselves modeled. See here for some discussion of Radford Neal’s ideas in favor of complex models, and see here for an example from my own applied research.

## 24 comments

Comments sorted by top scores.

## comment by drethelin · 2013-11-24T17:57:41.476Z · LW(p) · GW(p)

eh, this just seems like a repeat of arguments against greedy reductionism. Parsimony is good except when it loses information, but if you're losing information you're not being parsimonious correctly.

Replies from: timtyler## ↑ comment by timtyler · 2013-11-25T11:30:56.112Z · LW(p) · GW(p)

Parsimony is good except when it loses information, but if you're losing information you're not being parsimonious correctly.

So: Hamilton's rule is not being parsimonious "correctly"?

Replies from: drethelin## ↑ comment by drethelin · 2013-11-25T19:33:41.600Z · LW(p) · GW(p)

probably not. I'm not exactly sure what you mean by this question since I don't full understand hamilton's rule but in general evolutionary stuff only needs to be close enough to correct rather than actually correct.

Replies from: timtyler## comment by Cyan · 2013-11-25T22:07:19.463Z · LW(p) · GW(p)

Gelman wants to throw everything he can into his models -- and then use multilevel (a.k.a. hierarchical) models to share information between exchangeable (or conditionally exchangeable) batches of parameters. The key concept: multilevel model structure makes the "effective number of parameters" become a quantity that is itself inferred from the data. So he can afford to take his "against parsimony" stance (which is really a stance against leaving *potentially* useful predictors out of his models) because his default model choice will induce parsimony just when the data warrant it.

## ↑ comment by gwern · 2013-11-26T01:31:56.408Z · LW(p) · GW(p)

I think one of Gelman's comments in the first link is helpful:

In principle, models (at least for social-science phenomena) should be ever-expanding flowers that have have within them the capacity to handle small data sets (in which case, inferences will be pulled toward prior knowledge) or large data sets (in which case, the model will automatically unfold to allow the data to reveal more about the phenomenon under study). A single model will have zillions of parameters, most of which will barely be "activated" if sample size is not large.

In practice, those of us who rely on regression-type models and estimation procedures can easily lose control of large models when fit to small datasets. So, in practice, we start with simple models that we understand, and then we complexify them as needed. This has sometimes been formalized as a "sieve" of models and is also related to Cantor's "diagonal" argument from set theory. (In this context, I'm saying that for any finite class of models, there will be a dataset for which these models don't fit, thus requiring model expansion.)

## comment by Daniel_Burfoot · 2013-11-24T18:26:45.099Z · LW(p) · GW(p)

I’ve never seen any good general justification for parsimony...

This is a strange statement for a Bayesian to make. Perhaps he means that there is no reason to require *absolute* parsimony, which is true; sometimes if you have enough data you can justify the use of complex models. But Bayesian methods certainly require *relative* parsimony, in the sense that the model complexity needs to be small compared to the quantity of information being modeled. Formally, let A be the entropy of the prior distribution, and B be the mutual information between the observed data and the model parameter(s). Then unless A is small compared to B (relative parsimony), Bayesian updates won't substantially shift belief away from the prior, and the posterior will be just a minor modification of the prior, so the whole process of obtaining data and performing inference will have produced no actual change in belief.

The difference between the MDL philosophy and the Bayesian philosophy is actually quite minor. There are some esoteric technical arguments about things like whether one method or the other converges in the limit of infinite data, but at the end of the day the two philosophies say almost exactly the same thing.

Replies from: timtyler## ↑ comment by timtyler · 2013-11-26T11:03:39.927Z · LW(p) · GW(p)

Bayesian methods certainly require relative parsimony, in the sense that the model complexity needs to be small compared to the quantity of information being modeled.

Not really. Bayesian methods can model random noise. Then the model is of the same size as the data being modeled.

## comment by Strilanc · 2013-11-24T18:02:01.617Z · LW(p) · GW(p)

In practice, I often use simple models–because they are less effort to fit and, especially, to understand. But I don’t kid myself that they’re better than more complicated efforts!

Parsimony is a prior, not an end goal. At least, that's how it's used in Solomonoff induction.

The reason the Solomonoff prior doesn't apply to social sciences is because knowing the area of applicability gives you more information. Once you take that into account, as well as the fact that you don't have the input data or computational power to recompute the cumulative process that spit humans out so the simple low level theories are out of reach, your prior is skewed towards more complex models.

Replies from: timtyler, V_V## ↑ comment by timtyler · 2013-11-26T10:46:25.285Z · LW(p) · GW(p)

The reason the Solomonoff prior doesn't apply to social sciences is because knowing the area of applicability gives you more information.

That doesn't mean it doesn't apply! "Knowing the area of applicability" is just some information you can update on after starting with a prior.

## comment by Dr_Manhattan · 2013-11-27T16:41:41.434Z · LW(p) · GW(p)

While you can argue whether simpler models are inherently better - basically arguing about the "texture" of the universe we live in - simple models definitely generalize better, so if you act based on a simpler model you have better confidence that things will work "as expected". Flip coin of this is that to have confidence in complex models you need a lot more data, which is expensive in all kinds of ways.

You could claim that human attraction to simple models is due to their low cost/better generalization rather than b/c "texture of the world" is simple, though unification if physics seems to indicate the later.

## comment by timtyler · 2013-11-26T10:51:21.482Z · LW(p) · GW(p)

I often use simple models–because they are less effort to fit and, especially, to understand. But I don’t kid myself that they’re better than more complicated efforts!

Recommended reading: Boyd and Richerson's Simple Models of Complex Phenomena.

## comment by Lumifer · 2013-11-25T04:45:22.108Z · LW(p) · GW(p)

"Everything should be made as simple as possible, but not simpler." -- Albert Einstein.

But yes, Occam's Razor is not a natural law or anything like that. It's a **heuristic** -- something that usually points in the right direction but very much not guaranteed to be correct.

## ↑ comment by gjm · 2013-11-26T17:05:58.482Z · LW(p) · GW(p)

It's arguably a bit more than that, on account of Solomonoff induction. An "Occamian" prior that weights computable hypotheses according to the fraction of computer-program-space occupied by programs that compute their consequences provably performs -- in an admittedly somewhat artificial sense -- at least as well in the long run as *any* other prior, provided the observations you see really are generated by something computable.

More practically, there has to be a complexity penalty in the following sense: no matter what probabilities you assign, almost all very complex hypotheses have to be very improbable because otherwise your total probability has to be infinite.

Replies from: Eugine_Nier, Lumifer## ↑ comment by Eugine_Nier · 2013-12-02T00:29:02.152Z · LW(p) · GW(p)

An "Occamian" prior that weights computable hypotheses according to the fraction of computer-program-space occupied by programs that compute their consequences provably performs -- in an admittedly somewhat artificial sense -- at least as well in the long run as any other prior, provided the observations you see really are generated by something computable.

Yes and any prior that doesn't assign things zero probability has this property. Why that one in particular?

Replies from: gjm## ↑ comment by gjm · 2013-12-02T02:23:46.129Z · LW(p) · GW(p)

any prior that doesn't assign things zero probability has this property

Oh yes, so it does. Let me therefore be both more precise and more accurate.

Let p be an Occamian prior in this sense and q any computable prior. Then as cousin_it remarks "a computable human cannot beat Solomonoff in accumulated log scores by more than a constant, even if the universe is uncomputable and loves the human"; in other words, whatever q is -- however much information about the world is built into it in advance -- it can't do much better than p, even though p encodes *no* information about the world (it can't since what the theorem says is that even if you choose what the world does pessimally-for-p, it still does pretty well). This is not true for arbitrary priors.

## ↑ comment by Eugine_Nier · 2013-12-08T21:52:49.249Z · LW(p) · GW(p)

a computable human cannot beat Solomonoff in accumulated log scores by more than a constant, even if the universe is uncomputable and loves the human

Well, since Solomonoff is uncomputable, this isn't really a fair comparison.

Replies from: gjm## ↑ comment by gjm · 2013-12-09T02:01:49.308Z · LW(p) · GW(p)

I wasn't arguing that we should all be actually doing Solomonoff induction. (Clearly we can't.) I was saying that there is a somewhat-usable sense in which preferring simpler hypotheses seems to be The Right Thing, or at least A Right Thing. Namely, that basing your probabilities miraculously accurately on simplicity leads to good results. The same isn't true if you put something other than "simplicity" in that statement.

I wonder whether there are any theorems along similar lines that don't involve any uncomputable priors. (Something handwavily along the following lines: If p,q are two computable priors and p is dramatically enough "closer to Occamian" than q, then an agent with p as prior will "usually" do better than an agent with q as prior. But I have so far not thought of any statement of this kind that's both credible and interesting.)

## ↑ comment by Lumifer · 2013-11-26T18:39:36.108Z · LW(p) · GW(p)

...on account of Solomonoff induction

My impression is that Solomonoff induction starts by **assuming** the Occam's Razor.

no matter what probabilities you assign, almost all very complex hypotheses have to be very improbable because otherwise your total probability has to be infinite.

That's not a problem -- all simple hypotheses can be just as improbable.

Again, I am not saying that Occam's Razor is not a useful heuristic. It is. But it is not **evidence**.

## ↑ comment by TheOtherDave · 2013-11-26T20:09:18.348Z · LW(p) · GW(p)

I am not saying that Occam's Razor is not a useful heuristic. It is. But it is not evidence.

Can you restate what you consider the use of Occam's Razor to be, and what you consider evidence to be for?

Because from my perspective the purpose of evidence is to increase/decrease my confidence in various statements, and it seems to me that Occam's Razor is useful for doing precisely that. So this distinction doesn't make a lot of sense to me, and rereading the thread doesn't clarify matters.

## ↑ comment by gjm · 2013-11-26T20:22:02.475Z · LW(p) · GW(p)

My impression is that Solomonoff induction starts by

assumingthe Occam's Razor.

The fact that it buys you something interesting *without* making that assumption was the whole point of the paragraph you were commenting on.

That's not a problem -- all simple hypotheses can be just as improbable.

I don't believe that is true. Perhaps I've been insufficiently clear by trying to be brief (the difficulty being that "very complex" is really shorthand for something involving a limiting process), so let me be less brief.

First: Suppose you have a list of mutually exclusive hypotheses H1, H2, etc., with probabilities p1, p2, etc. List them in increasing order of complexity. Then the sum of all the pj is finite, and therefore as j -> infinity pj -> zero. Hence, "very complex hypotheses (in this list) have to be very improbable" in the following sense: for any probability p, however small, there's a level of complexity C such that every hypothesis from your list whose complexity is at least C has probability smaller than p.

That doesn't *quite* mean that very complex hypotheses have to be improbable. Indeed, you can construct very complex high-probability hypotheses as very long disjunctions. And since p and ~p have about the same complexity for any p, it must in some sense be true that about as many very complex propositions have high probabilities as have low probabilities. (So what I said certainly wasn't quite right.)

However, I bet something along the following lines is true. Suppose you have a probability distribution over propositions (this is for *generating* them, and isn't meant to have anything directly to do with the probability that each proposition is true), and suppose we also assign all the propositions probabilities in a way consistent with the laws of probability theory. (I'm assuming here that our class of propositions is closed under the usual logical operations.) And suppose we also assign all the propositions complexities in any reasonable way. Define the *essential complexity* of a proposition to be the infimum of the complexities of propositions that imply it. (I'm pretty sure it's always attained.) Then I conjecture that something like this is both true and fairly easy to prove: for any fixed probability level q, as C -> oo, if you generate a proposition at random (according to the "generating" distribution) conditional on its essential complexity being at least C, then Pr(its probability >= q) tends to 0.