Notes on Occam via Solomonoff vs. hierarchical Bayes

jesseclifton

Notes on Occam via Solomonoff vs. hierarchical Bayes

post by JesseClifton · 2025-02-10T17:55:14.689Z · LW · GW · 7 comments

  Simplicity via hierarchical Bayes
  Syntax vs. ontology
  References
None
7 comments

Crossposted from my Substack.

Intuitively, simpler theories are better all else equal. It also seems like finding a way to justify assigning higher prior probability to simpler theories is one of the more promising ways of approaching the problem of induction. In some places [LW · GW], Solomonoff induction [? · GW] (SI) seems to be considered the ideal way of encoding a bias towards simplicity. (Recall: under SI, hypotheses are programs that spit out observations. Programs of length C_L get prior probability 2^-C_L, where C_L is the program's length (in language L).

But I find SI pretty unsatisfying on its own, and think there might be a better approach (not original to me) to getting a bias towards simpler hypotheses in a Bayesian framework.

Simplicity via hierarchical Bayes

I’m not sure to what extent we need to directly bake in a bias towards simpler hypotheses in order to reproduce our usual inductive inferences or to capture the intuition that simpler theories tend to be better. Maybe we could at least get a long way with a hierarchically-structured prior, where:
- At the highest level, different theories T specify fundamental ontologies. For example, maybe the fundamental ontology of Ptolemaic astronomy was something like “The Earth is at the center of the universe, and all other bodies move along circles”.
- Each theory T contains many specific, disjoint hypotheses, corresponding to particular “parameter values” for the properties of the fundamental objects. For example, Ptolemaic astronomy as a high-level theory allows for many different planetary orbits.
- More complicated theories are those that contain many specific hypotheses. Complicated theories must spread out prior mass over more hypotheses, and if prior mass is spread evenly over the high-level theories, any individual hypothesis will get lower prior mass than individual hypotheses contained in simpler theories. I.e.:
  - Let h1, h2 be hypotheses in T1, T2 respectively.
  - Suppose T1 is simpler than T2. Then, generally we will have P(h1 | T1) > P(h2 | T2), because T2 has to spread out prior mass more thinly than T1.
  - If P(T1) = P(T2), then we have P(h1) = P(h1 | T1)*P(T1) > P(h2 | T2)*P(T2) = P(h2).
- This means that we can spread out prior mass evenly over the high-level theories (rather than giving lower prior mass to the complex high-level theories), and still find that the posterior mass of complex hypotheses is lower than that of equally-well-fitting simple hypotheses.
- Again, this way of thinking about the relationship between Bayesianism and simplicity is not original to me. See Henderson (2014) for a discussion in the philosophy of science, and Rasmussen and Gharamani (2000) for a discussion in the context of Bayesian machine learning. Huemer (2016) and Builes (2022) apply such reasoning to argue against skeptical theories.
A problem with this view: It’s not clear how to decide what should be a high-level theory. E.g., are Copernican and Ptolemaic astronomy two high-level theories, or are they two sub-theories of the high-level theory that says planets move along circles (but doesn’t fix the behavior of the Sun or Earth)?
- Intuitively, this doesn’t bother me a huge amount. Even if it ends up being underdetermined how to do this, my guess is that reasonable ways of individuating high-level theories will still constrain our inferences a lot. But, maybe not, I haven’t thought about it much.

Syntax vs. ontology

SI assigns prior probabilities according to the syntax (in an arbitrary language) used to specify a theory. Setting aside the other problems for SI (e.g., see this post [LW · GW]), I think this is pretty unsatisfactory as an attempt to capture our intuitive preference for simplicity, for a few reasons:
- First of all, I’d like to avoid just specifying by fiat that simpler hypotheses get higher prior probability and instead have this be a consequence of more solid principles. I think the principle of indifference is solid, if we can find a privileged parameterization of the hypothesis space to which we can apply the principle. The approach sketched above is attractive to me in this respect: We can try to apply a principle of indifference* at the level of fundamental ontological commitments, which has the consequence that hypotheses contained in more complex theories get lower prior mass.
  - *Of course, if we’re considering infinitely many theories/hypotheses we’re going to run into trouble trying to use the principle of indifference. But I still think this view takes us a long way.
  - A commenter points out [LW(p) · GW(p)] that Solomonoff induction can be seen as the application of the principle of indifference, i.e., “where you just take the uniform prior over all programs of length T, then let T go to infinity”. To be clear, my view is that the POI should be used when there is a nonarbitrary partition of the hypothesis space to which it can be applied, and this application of the POI is language-dependent. Whereas, on the hierarchical view, the hope is that the privileged parameterization to which you can apply the POI is something like “properties of the fundamental entities in the theory (e.g., positions and momenta of particles in Newtonian mechanics, maybe?)”. (See Huemer (2009) and Climenhaga (2020) on applying the POI at the “explanatorily basic” level.)
- Second of all, insofar as we do want to directly penalize more complex hypotheses, syntactic simplicity does not seem like the way to go. Surely when we intuit that simple theories are better, we have in mind the simplicity of a theory’s ontology (how many entities it posits, how uniform its laws are, etc). While the syntactic simplicity (in some natural-to-us programming language) of specifying a theory presumably correlates with the kind of simplicity we actually care about, they don’t seem to be the same thing.
  - So I would say: If you do want to directly assign prior probabilities to hypotheses according to their simplicity, you should start by looking at what the hypothesis actually says about the world and figure out how to measure the simplicity of that.
A possible response: Solomonoff induction is already a perfectly rigorous theory, which at least accords with many of our intuitions about epistemology. On the other hand, all this business about ontologies has yet to be formalized, and it’s far from clear that any satisfying formalism exists.
- My reply: This sounds like the streetlight effect. The reason that SI has a nice formalism is that it only looks at an easily-extracted property of a hypothesis (its syntax), and doesn’t attempt to extract the thing we should directly care about: what the hypothesis actually says about the world.
- Moreover, thinking in ontological terms may help make progress on one of the IMO serious problems for SI, the apparently arbitrary choice of language. For example, we may in the end decide that the best we can do is SI using a language that makes it easy to specify a hypothesis in terms of its ontology?

References

Builes, David. 2022. “The Ineffability of Induction.” Philosophy and Phenomenological Research 104 (1): 129–49.
Climenhaga, Nevin. 2020. “The Structure of Epistemic Probabilities.” Philos. Stud. 177 (11): 3213–42.
Henderson, Leah. 2014. “Bayesianism and Inference to the Best Explanation.” The British Journal for the Philosophy of Science 65 (4): 687–715.
Huemer, Michael. 2009. “Explanationist Aid for the Theory of Inductive Logic.” The British Journal for the Philosophy of Science 60 (2): 345–75.
———. 2016. “Serious Theories and Skeptical Theories: Why You Are Probably Not a Brain in a Vat.” Philosophical Studies 173 (4): 1031–52.
Rasmussen, Carl, and Zoubin Ghahramani. 2000. “Occam’s Razor.” Advances in Neural Information Processing Systems 13. https://proceedings.neurips.cc/paper/2000/hash/0950ca92a4dcf426067cfd2246bb5ff3-Abstract.html.

7 comments

Comments sorted by top scores.

comment by Lucius Bushnaq (Lblack) · 2025-02-10T18:52:30.095Z · LW(p) · GW(p)

First of all, I’d like to avoid just specifying by fiat that simpler hypotheses get higher prior probability and instead have this be a consequence of more solid principles. I think the principle of indifference is solid, if we can find a privileged parameterization of the hypothesis space to which we can apply the principle.

Solomonoff induction (SI) is already reducible to the principle of indifference. For UTMs with prefix-free codes, SI assigns prior to programs of length $K$ .^[1] But there's another formulation over plain, non-prefix free codes, where you just take the uniform prior over all programs of length $T$ , then let $T$ go to infinity. These are equivalent. See this book on pages 145-146.

If this is unintuitive to you, I'd suggest having a look at this [LW(p) · GW(p)] shortform or this [LW(p) · GW(p)] one. The multiplicity of equivalent implementations automatically encodes the exponentially larger prior for simpler programs.

I think the only thing left to debate over here is why we should start out assuming that our observations of the world are produced by some program running on some UTM, rather than, say, assigning equal probability to all strings of bits we might observe. In what sense does the universe have structure, such that a-priori bit strings of observations about the universe ought to be treated by us as more than just members of the set of all possible bit strings?

^{^}
This is not the same as giving hypothesis $h$ $2^{- C (p, L)}$ prior as you state near the beginning, afaik that's false [LW · GW].

Replies from: JesseClifton, jchan

↑ comment by JesseClifton · 2025-02-10T19:27:10.360Z · LW(p) · GW(p)

Thanks!

where you just take the uniform prior over all programs of length T, then let T to infinity

Sure, but because of language-dependence I'm not sure why we would want to apply the principle of indifference at this level. (Note that the quote says "if we can find a privileged parameterization to which we can apply the principle".) I tend to think you should apply the POI at the "explanatorily basic" level (see here, here), which might be the properties of fundamental objects in the ontology (e.g., position and momentum in Newtonian mechanics, maybe?). Otherwise I think you run into unsatisfying-to-me arbitrariness.

In what sense does the universe have structure, such that a-priori bit strings of observations about the universe ought to be treated by us as more than members of the set of all possible bit strings?

Right, I think this is the kind of question you're not going to be able to answer without thinking about the kinds of ontological considerations pointed to here.

afaik that's false

Thanks, will change.

↑ comment by jchan · 2025-02-12T20:09:48.959Z · LW(p) · GW(p)

rather than, say, assigning equal probability to all strings of bits we might observe

If the space of possibilities is not arbitrarily capped at a certain length, then such a distribution would have to favor shorter strings over longer ones in much the same way as the Solomonoff prior over programs (because if it doesn't, then its sum will diverge, etc.). But then this yields a prior that is constantly predicting that the universe will end at every moment, and is continually surprised when it keeps on existing. I'm not sure if this is logically inconsistent, but at least it seems useless for any practical purpose.

comment by quetzal_rainbow · 2025-02-10T21:10:46.488Z · LW(p) · GW(p)

I think there is a reducibility from one to another using different UTMs? I.e., for example, causal networks are Turing-complete, therefore, you can write UTM that explicitly takes description of initial conditions, causal time evolution law and every SI-simple hypothesis here will correspond to simple causal-network hypothesis. And you can find the same correspondence for arbitrary ontologies which allow for Turing-complete computations.

comment by JBlack · 2025-02-13T01:59:05.321Z · LW(p) · GW(p)

One thing that seems worth exploring from a conceptual point of view is doing away with priors altogether, and working more directly with metrics such as "what are the most expected-value rewarding actions that a bounded agent can make given the evidence so far". I suspect that from this point of view it doesn't much matter whether you use a computational basis such as Turing machines, something more abstract, or even something more concrete such as energy required to assemble and run a predicting machine.

From a computing point of view not all simple models will require the fewest resources available to actually follow it through and compare against other models, so in that respect this differs from Solomonoff induction (which assumes a system of infinite computing power). However, my expectation is that there usually would be some model only a bit more complex than the simplest that makes similar predictions at lower cost. A heuristic that examines models in a simplest-first order (but discarding ones that look expensive to evaluate) may well end up being close to optimal in trading off prediction accuracy against whatever costs there are of evaluating multiple models. There are exponentially fewer simple models to evaluate.

Replies from: antimonyanthony

↑ comment by Anthony DiGiovanni (antimonyanthony) · 2025-02-13T18:17:52.836Z · LW(p) · GW(p)

working more directly with metrics such as "what are the most expected-value rewarding actions that a bounded agent can make given the evidence so far"

I'm not sure I exactly understand your argument, but it seems like this doesn't avoid the problem of priors, because what's the distribution w.r.t. which you define "expected-value rewarding"?

Replies from: JBlack

↑ comment by JBlack · 2025-02-14T02:33:16.464Z · LW(p) · GW(p)

It doesn't completely avoid the problem of priors, just the problem of arbitrarily fixing a specific type of update rule on fixed priors such as in Solomonoff induction. You can't afford this if you're a bounded agent, and a Solomonoff inductor can only get away with it since it has not just unbounded resources but actually infinite computational power in any given time period.

A bounded agent needs needs to be able to evaluate alternative priors, update rules, and heuristics in addition to the evidence and predictions themselves, or it won't even approximate bounded rationality. While this is a more complicated scenario than the Solomonoff updater in some senses, it is philosophically simpler since we can view it more like a "bootstrap" process and ask what sort of bootstrapping might "generally" do well rather than taking anything as fixed.

I suspect that heuristics that score highly involve "universal" but finite systems (such as Turing machines or other mathematical structures capable of representing their own rules), and a "simple and not too costly" evaluation heuristic (not just simplicity).

There would be "degenerate" distributions of universe rules that would be exceptions, so there is still a problem of describing what sort of distributions I'm thinking of as being "degenerate", and naturally this whole sort of statement is too vague to prove any such thing even if such proofs weren't famously difficult (and plausibly impossible to prove even if not false).

Notes on Occam via Solomonoff vs. hierarchical Bayes

Contents

Simplicity via hierarchical Bayes

Syntax vs. ontology

References

7 comments