Against Occam's Razor

zulupineapple

Against Occam's Razor

post by zulupineapple · 2018-04-05T17:59:27.583Z · LW · GW · 20 comments

20 comments

Why should the Occamian prior work so well in the real world? It's a seemingly profound mystery that is asking to be dissolved.

To begin with, I propose a Lazy Razor and a corresponding Lazy prior:

Given several competing models of reality, we should select the one that is easiest to work with.

This is merely a formulation of the obvious trade-off between accuracy and cost. I would rather have a bad prediction today than a good prediction tomorrow or a great prediction ten years from now. Ultimately, this prior will deliver a good model, because it will let you try out many different models fast.

The concept of "easiness" may seem even more vague than "complexity", but I believe that in any specific context its measurement should be clear. Note, "easiness" is measured in man-hours, dollars or etc, it's not to be confused with "hardness" in the sense of P and PN. If you still don't know how to measure "easiness" in your context, you should use the Lazy prior to choose an "easiness" measurement procedure. To break the recursive loop, know that the Laziest of all models is called "pulling numbers out of your ass".

Now let's return to the first question. Why should the Occamian prior work so well in the real world?

The answer is, it doesn't, not really. Of all the possible priors, the Occamian prior holds no special place. Its greatest merit is that it often resembles Lazy prior in the probabilities it offers. Indeed it is easy to see, that a random model with a billion parameters is disliked by both priors, and that a model with two parameters is loved by both. By the way, its second greatest merit is being easy to work with.

Note, the priors are not interchangeable. One case where they disagree is on making use of existing resources. Suppose mathematics has derived powerful tools for working with A-theory but not B-theory. Then Lazy prior would suggest that a complex model based on A-theory may be preferable to a simpler one based on B-theory. Or, suppose some process took millions of years to produce abundant and powerful meat-based computers. Then Lazy prior would suggest that we make use of them in our models, regardless of their complexity, while the Occamian prior would object.

20 comments

Comments sorted by top scores.

comment by Qiaochu_Yuan · 2018-04-13T17:13:10.286Z · LW(p) · GW(p)

I mentioned to Eliezer once a few years ago that a weak form of Occam's razor, across a countable hypothesis space, is inevitable. This observation was new to him so it seems worth reproducing here.

Suppose $P (i)$ is any prior whatsoever over a countable hypothesis space, for example the space of Turing machines. The condition that $\sum_{i} P (i) = 1$ , and in particular the condition that this sum converges, implies that for every $ϵ > 0$ we can find $N$ such that $\sum_{i \geq N} P (i) < ϵ$ ; in other words, the total probability mass of hypotheses with sufficiently large indices gets arbitrarily small. If the indices index hypotheses by increasing complexity, this implies that the total probability mass of sufficiently complicated hypotheses gets arbitrarily small, no matter what the prior is.

The real kicker is that "complexity" can mean absolutely anything in the argument above; that is, the indexing can be arbitrary and the argument will still apply. And it sort of doesn't matter; any indexing has the property that it will eventually exhaust all of the sufficiently "simple" hypotheses, according to any other definition of "simplicity," because there aren't enough "simple" hypotheses to go around, and so must eventually have the property that the hypotheses being indexed get more and more "complicated," whatever that means.

So, roughly speaking, weak forms of Occam's razor are inevitable because there just aren't as many "simple" hypotheses as "complicated" ones, whatever "simple" and "complicated" mean, so "complicated" hypotheses just can't have that much probability mass individually. (And in turn the asymmetry between simple and complicated is that simplicity is bounded but complexity isn't.)

There's also an anthropic argument for stronger forms of Occam's razor that I think was featured in a recentish post: worlds in which Occam's razor doesn't work are worlds in which intelligent life probably couldn't have evolved.

Replies from: zulupineapple

↑ comment by zulupineapple · 2018-04-14T08:24:08.752Z · LW(p) · GW(p)

worlds in which Occam's razor doesn't work are worlds in which intelligent life probably couldn't have evolved.

Can you elaborate or share a link? I would be suspicious of such an argument. To the contrary, I'd say, humans are very complex and also model the world very well. If there could exist simpler models of comparable accuracy, then human complexity would not have evolved.

a weak form of Occam's razor, across a countable hypothesis space, is inevitable.

Absolutely true, but it's a bit of a technicality. Remember that the number of hypotheses you could write down in the lifetime of the universe is finite. A prior on those hypotheses can be arbitrarily un-Occamian.

comment by gwern · 2018-04-05T18:09:00.477Z · LW(p) · GW(p)

Doesn't the speed prior diverge quite rapidly from the universal prior? There are many short programs of length _n_ which take a long time to compute their final result - up to BB(_n_) timesteps, specifically...

Replies from: zulupineapple

↑ comment by zulupineapple · 2018-04-05T19:38:38.679Z · LW(p) · GW(p)

Yes, the two priors aren't as close as I might have implied. But still there are many cases where they agree. For example, given a random 6-state TM and a random 7-state TM, both Lazy and Occamian priors will usually prefer the 6-state machine.

By the way, if I had to simulate these TMs by hand, I could care a lot about computation time, but now that we have cheap computers, computation time has a smaller coefficient, and the time for building the TM is more important. This is how it works, "easiness" is measured in man-hours, it's not just the number of steps the TM makes.

comment by Dagon · 2018-04-05T19:20:35.236Z · LW(p) · GW(p)

How do you define "works well" for a prior? I argue that for most things, the universal prior (everything is equally likely) works about as well as the lazy prior or Occam's prior, because _all_ non-extreme priors are overwhelmed with evidence (including evidence from other agents) very rapidly. all three fail in the tails, but do just fine for the majority of uses.

Now if you talk about a measure of model simplicity and likelihood to apply to novel situations, rather than probability of prediciton, then it's not clear that universal is usable, but it's also not clear that lazy is better or worse than Occam.

Replies from: zulupineapple

↑ comment by zulupineapple · 2018-04-05T20:03:34.356Z · LW(p) · GW(p)

By "works well" I mean that we find whatever model we were looking for. Note that I didn't say "eventually" (all priors work "eventually", unless they assign too many 0 probabilities).

Replies from: TAG

↑ comment by TAG · 2018-04-13T13:50:06.382Z · LW(p) · GW(p)

That seems susceptible to circularity. If we are looking for a simple model, we will get one. But what if we are looking for a true model? Is the simplest model necessarily true?

Replies from: zulupineapple

↑ comment by zulupineapple · 2018-04-13T16:58:47.597Z · LW(p) · GW(p)

We aren't looking for a simple model, we are looking for a model that generates accurate predictions. For instance, we could have two agents with two different priors independently working on the same problem (e.g. weather forecasting) for a fixed amount of time, and then see which of them found a more accurate model. Then, whoever wins gets to say that his prior is better. Nothing circular about it.

comment by gbear605 · 2018-04-05T18:16:33.548Z · LW(p) · GW(p)

"Why should the Occamian prior work so well in the real world?"

A different way of phrasing Occam's Razor is "Given several competing models of reality, the most likely one is the one that involves multiplying together the fewest probabilities." That's because each additional level of complexity is adding another probability that needs to be multiplied together. It's a simple result of probability.

Replies from: zulupineapple

↑ comment by zulupineapple · 2018-04-05T19:44:42.237Z · LW(p) · GW(p)

That's because each additional level of complexity is adding another probability that needs to be multiplied together. It's a simple result of probability.

One problem with that explanation is that it does not reference the current universe at all. It implies that the Occamian prior should work well in any universe where the laws of probability hold. Is that really true? Note, on the other hand, that in Lazy prior, the measure of "easiness" is very much based in how this universe works and what state it is in.

Replies from: gbear605, scarcegreengrass

↑ comment by gbear605 · 2018-04-10T22:40:11.710Z · LW(p) · GW(p)

I believe that the Occamian prior should hold true in any universe where the laws of probability hold. I don't see any reason why not, since the assumption behind it is that all the individual levels of complexity of different models have roughly the same probability.

Replies from: zulupineapple

↑ comment by zulupineapple · 2018-04-11T05:57:01.023Z · LW(p) · GW(p)

Laws of probability say that

P (A \cap B) \leq P (A)

I suspect that to you "Occam's Razor" refers to this law (I don't think that's the usual interpretation, but it's reasonable). However this law does not make a prior. It does not say anything about whether we should prefer a 6-state Turing machine to a 100-state TM, when building a model. Try using the laws of probability to decide that.

the Occamian prior should hold true

Priors don't "hold true", that's a type error (or at least bad wording).

Replies from: gbear605

↑ comment by gbear605 · 2018-04-12T22:18:08.827Z · LW(p) · GW(p)

That is indeed what it means in my mind.

I agree that it was bad wording. Perhaps something more along the lines of "should work well."

↑ comment by scarcegreengrass · 2018-04-05T20:47:39.758Z · LW(p) · GW(p)

It implies that the Occamian prior should work well in any universe where the laws of probability hold. Is that really true?

Just to clarify, are you referring to the differences between classical probability and quantum amplitudes? Or do you mean something else?

Replies from: zulupineapple

↑ comment by zulupineapple · 2018-04-06T06:18:47.744Z · LW(p) · GW(p)

Not at all. I'm repeating a truthism: to make a claim about the territory, you should look at the territory. "Occamian prior works well" is an empirical claim about the real world (though it's not easy to measure). "Probabilities need to be multiplied" is a lot less empirical (it's about as empirical as 2+2=4). Therefore the former shouldn't follow from the latter.

comment by daozaich · 2018-04-06T12:57:58.613Z · LW(p) · GW(p)

I have a feeling that you mix probability and decision theory. Given some observations, there are two separate questions when considering possible explanations / models:

1. What probability to assign to each model?

2. Which model to use?

Now, our toy-model of perfect rationality would use some prior, e.g. the bit-counting universal/kolmogorov/occam one, and bayesian update to answer (1), i.e. compute the posterior distribution. Then, it would weight these models by "convenience of working with them", which goes into our expected utility maximization for answering (2), since we only have finite computational resources after all. In many cases we will be willing to work with known wrong-but-pretty-good models like Newtonian gravity, just because they are so much more convenient and good enough.

I have a feeling that you correctly intuit that convenience should enter the question which model to adopt, but misattribute this into the probability-- but which model to adopt should formally be bayesian update + utility maximization (taking convenience and bounded computational resources into account), and definitely not "Bayesian update only", which leads you to the (imho questionable) conclusion that the universal / kolmogorov / occam prior is flawed for computing probability.

On the other hand, you are right that the above toy model of perfect rationality is computationally bad: Computing the posterior distribution after some prior and then weighting by utility/convenience is of stupid if directly computing prior * convenience is cheaper than computing prior and convenience separately and then multiplying. More generally, probability is a nice concept for human minds to reason about reasoning, but we ultimately care about decision theory only.

Always combining probability and utility might be a more correct model, but it is often conceptually more complex to my mind, which is why I don't try to always adopt it ;)

Replies from: zulupineapple

↑ comment by zulupineapple · 2018-04-06T14:02:12.524Z · LW(p) · GW(p)

You are correct that Lazy prior largely encodes considerations of utility maximization. My core point isn't that Lazy prior is some profound idea.

Instead my core point is that the Occamian prior is not profound either. It has only a few real merits. One minor merit is that it is simple to describe and to reason about, which makes it a high-utility choice of a prior, at least for theoretical discussions.

But the greatest merit of Occamian prior is that it vaguely resembles the Lazy prior. That is, it also encodes some of the same considerations of utility maximization. I'm suggesting that, whenever someone talks about the power of Occam's razor or the mysterious simplicity of nature, what is happening is in fact this: the person did not bother to do proper utility calculations, Occamian prior encoded some of those calculations by construction, and therefore the person managed to reach a high-utility result with less effort.

With that in mind, I asked what prior would serve this purpose even better and arrived at Lazy prior. The idea of encoding these considerations in a prior may seem like an error of some kind, but the choice of a prior is subjective by definition, so it should be fine.

(Thanks for the comment. I found it useful. I hadn't explicitly considered this criticism when I wrote the post, and I feel that I now understand my own view better.)

Replies from: daozaich

↑ comment by daozaich · 2018-04-06T15:03:33.669Z · LW(p) · GW(p)

>But the greatest merit of Occamian prior is that it vaguely resembles the Lazy prior.

...

>With that in mind, I asked what prior would serve this purpose even better and arrived at Lazy prior. The idea of encoding these considerations in a prior may seem like an error of some kind, but the choice of a prior is subjective by definition, so it should be fine.

Encoding convenience * probability into some kind of pseudo-prior such that the expected-utility maximizer is the maximum likelihood model with respect to the pseudo-prior does seem like a really useful computational trick, and you are right that terminology should reflect this. And you are right that the Occam prior has the nice property that weight-by-bit-count is often close to convenience, and hence makes the wrong naive approach somewhat acceptable in practice: That is, just taking the max likelihood model with respect to bit-count should often be a good approx for weight-by-bitcount * convenience (which is the same as weight-by-bitcount for probability and maximize expected utility).

In cases where we know the utility we can regenerate probabilities afterwards. So I would now be really interested in some informal study of how well Occam actually performs in practice, after controlling for utility: You are right that the empirical success of Occam might be only due to the implicit inclusion of convenience (succinct-by-bit-count models are often convenient) when doing the (wrong!) max-likelihood inference. I had not considered this, so thanks also for your post; we both learned something today.

I'd also remark/reiterate the point in favor of the Lazy prior: The really terrible parts of working with Occam (short descriptions that are hard to reason about, aka halting problem) get cancelled out in the utility maximization anyway. Lazy avoids invoking the halting-problem oracle in your basement for computing these terms (where we have the main differences between Occam vs Lazy). So you are right after all: Outside of theoretical discussion we should all stop using probabilities and Occam and switch to some kind of Lazy pseudo-prior. Thanks!

That being said, we all appear to agree that Occam is quite nice as an abstract tool, even if somewhat naive in practice.

A different point in favor of Occam is "political objectivity": It is hard to fudge in motivated reasoning. Just like the "naive frequentist" viewpoint sometimes wins over Bayes with respect to avoiding politically charged discussions of priors, Occam defends against "witchcraft appears natural to my mind, and the historical record suggests that humans have evolved hardware acceleration for reasoning about witchcraft; so, considering Lazy-prior, we conclude that witches did it" (Occam + utility maximization rather suggests the more palatable formulation "hence it is useful to frame these natural processes in terms of Moloch, Azatoth and Cthulhu battling it out", which ends up with the same intuitions and models but imho better mental hygiene)

comment by Gordon Seidoh Worley (gworley) · 2018-04-05T18:37:42.711Z · LW(p) · GW(p)

Why should the Occamian prior work so well in the real world? It's a seemingly profound mystery that is asking to be dissolved.

Is it? It seems a rather straightforward consequence of how knowledge works, as in information allows you to establish probabilistic beliefs and then probability theory explains pretty simply what Occam's Razor is.

Replies from: zulupineapple

↑ comment by zulupineapple · 2018-04-05T20:20:00.054Z · LW(p) · GW(p)

See my earlier [LW(p) · GW(p)] reply to a similar comment.

Against Occam's Razor

Contents

20 comments