Non-trivial probability distributions for priors and Occam's razor

post by JoshuaZ · 2011-01-11T03:59:28.947Z · LW · GW · Legacy · 22 comments

Contents

22 comments

Assume we have a countable set of hypotheses described in some formal way with some prior distribution such that 1) our prior for each hypothesis is non-zero 2) our formal description system has only a finite number of hypotheses of any fixed length. Then, I claim that that under just this set of weak constraints, our hypotheses are under a condition that informally acts a lot like Occam's razor. In particular, let h(n) be the the probability mass assigned to "a hypothesis with description at least exactly n is correct." (ETA: fixed from earlier statement) Then, as n goes to infinity, h(n) goes to zero. So, when one looks in the large-scale complicated hypotheses must have low probability. This suggests that one doesn't need any appeal to computability or anything similar to accept some form of Occam's razor. One only needs that one has a countable hypothesis space, no hypothesis has probability zero or one, and that has a non-stupid way of writing down hypotheses.

A few questions: 1) Am I correct in seeing this as Occam-like or is this just an indication that I'm using too weak a notion of Occam's razor?   

2) Is this point novel? I'm not as familiar with the Bayesian literature as other people here so I'm hoping that someone can point out if this point has been made before.

ETA: This was apparently a point made by Unknowns in an earlier thread which I totally forgot but probably read at the time. Thanks also for the other pointers.

 

22 comments

Comments sorted by top scores.

comment by Zack_M_Davis · 2011-01-11T04:49:36.578Z · LW(p) · GW(p)

Is this point novel?

Compare Rob Zahra's April 2009 comment.

Replies from: cousin_it
comment by cousin_it · 2011-01-11T05:29:01.654Z · LW(p) · GW(p)

Thanks, that's a very good overview that I haven't seen before.

comment by Unnamed · 2011-01-11T05:22:17.693Z · LW(p) · GW(p)

A Proof of Occam's Razor

Replies from: JoshuaZ
comment by JoshuaZ · 2011-01-11T05:26:34.816Z · LW(p) · GW(p)

Oh doh! I even posted in that thread. I have zero recall of reading that but I see comments made by me in the thread so I must have read it...gah. Now I feel silly.

comment by Manfred · 2011-01-11T17:21:38.452Z · LW(p) · GW(p)

There is a problem - length isn't the only way to order hypotheses. If there was some other method of ordering that fulfilled the right conditions (even just some obvious function of length, e.g. length char value of the first letter), it could be used instead to order hypotheses, which would then be decreasing in probability on that* parameter instead. You could give bonus points for mentioning the Flying Spaghetti Monster if you wanted, and it would be a perfectly fine ordering.

Replies from: Sniffnoy
comment by Sniffnoy · 2011-01-11T22:30:06.336Z · LW(p) · GW(p)

That doesn't in any way contradict this, that just demonstrates the fact that limits of sequences are invariant under a permutation of that sequence.

Replies from: Manfred
comment by Manfred · 2011-01-12T00:58:44.244Z · LW(p) · GW(p)

It quite thoroughly contradicts this, since it means that this isn't a proof of Occam's razor, merely a proof that there is some sequence.

Replies from: Sniffnoy
comment by Sniffnoy · 2011-01-12T01:41:18.000Z · LW(p) · GW(p)

Yes, it demonstrates that this statement is too weak to be really called "Occam's razor". But it doesn't contradict the statement. Is this standard, to use "contradict" to mean "contradict the intended connotation of"? That seems confusing.

Replies from: jsteinhardt
comment by jsteinhardt · 2011-01-13T04:56:45.179Z · LW(p) · GW(p)

I feel like at the very least, a demonstration that the argument, while technically correct, doesn't achieve the real-world implications it intended to should be taken as an invalidation of the point. I find that it is far too often that someone clings to the fact that their argument is technically correct even though it is irrelevant practically.

Not that anyone is doing that in this thread, as far as I can tell. But it's something to watch out for.

comment by jimrandomh · 2011-01-11T14:12:28.383Z · LW(p) · GW(p)

This is certainly an interesting line of investigation, and as far as I know it's still unsolved. There's still some things that need to be proven to get to an Occam-like distribution, which I mentioned the last time this came up.

Specifically, I haven't seen any good justification for the assumption about there being a finite number of correct hypotheses (or any variant on this assumption; there's lots to choose from), since there could be hypothesis-schemas that generate infinite numbers of true hypotheses. A full analysis of this question would probably have to account for those, in a way that assigns all the hypotheses from a particular schema complexity of the schema. I also haven't seen anyone argue from "prior probabilities decrease monotonically" to "prior probabilities decrease exponentially"; it seems natural since the size of the space of hypotheses of length n increases exponentially with n, but I don't know how to prove that they don't instead decrease faster or slower than exponentially, or what other assumptions may be necessary.

Replies from: jsteinhardt, benelliott
comment by jsteinhardt · 2011-01-13T04:58:27.683Z · LW(p) · GW(p)

You don't need finite, only countable. Assuming that our conception of the universe is approximately correct, we are only capable of generating a countable set of hypotheses.

Replies from: jimrandomh
comment by jimrandomh · 2011-01-13T12:42:20.972Z · LW(p) · GW(p)

Huh? There are only countably many statements in any string-based language, so this includes decidedly non-Occamian scenarios like every statement being true, or every statement being true with the same probability.

Replies from: jsteinhardt
comment by jsteinhardt · 2011-01-14T06:34:26.548Z · LW(p) · GW(p)

I'm confused. What is the countable set of hypotheses you are considering? My claim is merely that if you have hypotheses H1, H2, ..., then p(Hi) > 1/n for at most n-1 values of i. This can be thought of as a weak form of Occam's razor.

In what sense is "every statement being true" a choice of a countable set of hypotheses?

I think maybe the issue is that we are using hypothesis in a different sense. In my case a hypothesis is a complete model of the world, so it is not possible for multiple hypotheses to be true. You can marginalize out / observe a bunch of variables to talk about a subset of the world, but your hypotheses should still be mutually exclusive.

comment by benelliott · 2011-01-11T18:22:55.200Z · LW(p) · GW(p)

I think the hypotheses are assumed to be mutually exclusive. For example you could have a long list of possible sets of laws of physics, at most one is true of this universe.

Replies from: jimrandomh
comment by jimrandomh · 2011-01-11T18:41:32.149Z · LW(p) · GW(p)

Right, that's another way of stating the same assumption. But we usually apply Occam's razor to statements in languages that admit sets of non-mutually-exclusive hypotheses of infinite size. So you'd need to somehow collapse or deduplicate those in a way that makes them finite.

Replies from: benelliott
comment by benelliott · 2011-01-11T19:00:21.102Z · LW(p) · GW(p)

I find I mostly apply Occam's razor to mutually exclusive hypotheses, e.g. explanation A of phenomenon X is better than explanation B because it is simpler.

comment by jsteinhardt · 2011-01-11T05:00:31.815Z · LW(p) · GW(p)

People are definitely aware of this in the literature, but still, congratulations on re-deriving it (and consider reading more of the literature if this stuff interests you). Another version: at most n hypothesis can have probability greater than 1/n. This doesn't even reference description length.

Replies from: benelliott
comment by benelliott · 2011-01-11T18:17:39.679Z · LW(p) · GW(p)

Actually, at most (n-1) hypotheses can have probability greater than 1/n.

comment by Matt_Simpson · 2011-01-11T23:38:30.199Z · LW(p) · GW(p)

I vaguely remember a similar point being made by Jaynes in PT:TLoS

comment by Perplexed · 2011-01-11T16:25:01.181Z · LW(p) · GW(p)

I must be missing something. Your argument makes no sense at all to me. It is very simple to construct a countable set of hypotheses of various lengths, all of which have prior probability exactly 1/2. The hypotheses denote events in a universe consisting of an infinite sequence of coin flips, for example.

Furthermore, I must be misreading your definition of h(n), because as I read it, h(n) goes to 1 as n goes to infinity. I.e. it becomes overwhelmingly likely that at least one shorter-than-n hypothesis is correct.

Replies from: JoshuaZ, Manfred
comment by JoshuaZ · 2011-01-11T19:10:31.503Z · LW(p) · GW(p)

I must be missing something. Your argument makes no sense at all to me. It is very simple to construct a countable set of hypotheses of various lengths, all of which have prior probability exactly 1/2. The hypotheses denote events in a universe consisting of an infinite sequence of coin flips, for example.

Hypotheses in this sense are exclusive. If you prefer, consider hypotheses to be descriptors of all possible data one will get from some infinite string of bits.

Furthermore, I must be misreading your definition of h(n), because as I read it, h(n) goes to 1 as n goes to infinity. I.e. it becomes overwhelmingly likely that at least one shorter-than-n hypothesis is correct.

Sorry, yes, h(n) should be for statements with length at least n not at most n.

comment by Manfred · 2011-01-11T17:14:53.723Z · LW(p) · GW(p)

Insert "mutually exclusive" where appropriate :D

Yes, the argument is confused, but I think that's only the writing, not the idea. I think this may not be as general as it could be, though - it would be nice if Occam's razor applied for other conditions.

Oh, wait - it doesn't quite work. I should probably write my own reply for that bit.