Notes on Occam via Solomonoff vs. hierarchical Bayes

post by JesseClifton · 2025-02-10T17:55:14.689Z · LW · GW · 7 comments

Contents

  Simplicity via hierarchical Bayes
  Syntax vs. ontology
  References
None
7 comments

Crossposted from my Substack.

Intuitively, simpler theories are better all else equal. It also seems like finding a way to justify assigning higher prior probability to simpler theories is one of the more promising ways of approaching the problem of induction. In some places [LW · GW], Solomonoff induction [? · GW] (SI) seems to be considered the ideal way of encoding a bias towards simplicity. (Recall: under SI, hypotheses are programs that spit out observations. Programs of length CL get prior probability 2^-CL, where CL is the program's length (in language L).

But I find SI pretty unsatisfying on its own, and think there might be a better approach (not original to me) to getting a bias towards simpler hypotheses in a Bayesian framework.

Simplicity via hierarchical Bayes

Syntax vs. ontology

References

Builes, David. 2022. “The Ineffability of Induction.” Philosophy and Phenomenological Research 104 (1): 129–49.
Climenhaga, Nevin. 2020. “The Structure of Epistemic Probabilities.” Philos. Stud. 177 (11): 3213–42.
Henderson, Leah. 2014. “Bayesianism and Inference to the Best Explanation.” The British Journal for the Philosophy of Science 65 (4): 687–715.
Huemer, Michael. 2009. “Explanationist Aid for the Theory of Inductive Logic.” The British Journal for the Philosophy of Science 60 (2): 345–75.
———. 2016. “Serious Theories and Skeptical Theories: Why You Are Probably Not a Brain in a Vat.” Philosophical Studies 173 (4): 1031–52.
Rasmussen, Carl, and Zoubin Ghahramani. 2000. “Occam’s Razor.” Advances in Neural Information Processing Systems 13. https://proceedings.neurips.cc/paper/2000/hash/0950ca92a4dcf426067cfd2246bb5ff3-Abstract.html.

7 comments

Comments sorted by top scores.

comment by Lucius Bushnaq (Lblack) · 2025-02-10T18:52:30.095Z · LW(p) · GW(p)

First of all, I’d like to avoid just specifying by fiat that simpler hypotheses get higher prior probability and instead have this be a consequence of more solid principles. I think the principle of indifference is solid, if we can find a privileged parameterization of the hypothesis space to which we can apply the principle.

Solomonoff induction (SI) is already reducible to the principle of indifference. For UTMs with prefix-free codes, SI assigns prior  to programs of length .[1] But there's another formulation over plain, non-prefix free codes, where you just take the uniform prior over all programs of length , then let  go to infinity. These are equivalent. See this book on pages 145-146

If this is unintuitive to you, I'd suggest having a look at this [LW(p) · GW(p)] shortform or this [LW(p) · GW(p)] one. The multiplicity of equivalent implementations automatically encodes the exponentially larger prior for simpler programs.

I think the only thing left to debate over here is why we should start out assuming that our observations of the world are produced by some program running on some UTM, rather than, say, assigning equal probability to all strings of bits we might observe. In what sense does the universe have structure, such that a-priori bit strings of observations about the universe ought to be treated by us as more than just members of the set of all possible bit strings?

  1. ^

    This is not the same as giving hypothesis   prior as you state near the beginning, afaik that's false [LW · GW].

Replies from: JesseClifton, jchan
comment by JesseClifton · 2025-02-10T19:27:10.360Z · LW(p) · GW(p)

Thanks!

where you just take the uniform prior over all programs of length T, then let T to infinity

Sure, but because of language-dependence I'm not sure why we would want to apply the principle of indifference at this level. (Note that the quote says "if we can find a privileged parameterization to which we can apply the principle".) I tend to think you should apply the POI at the "explanatorily basic" level (see here, here), which might be the properties of fundamental objects in the ontology (e.g., position and momentum in Newtonian mechanics, maybe?). Otherwise I think you run into unsatisfying-to-me arbitrariness.

In what sense does the universe have structure, such that a-priori bit strings of observations about the universe ought to be treated by us as more than members of the set of all possible bit strings?

Right, I think this is the kind of question you're not going to be able to answer without thinking about the kinds of ontological considerations pointed to here.

afaik that's false

Thanks, will change.

comment by jchan · 2025-02-12T20:09:48.959Z · LW(p) · GW(p)

rather than, say, assigning equal probability to all strings of bits we might observe

If the space of possibilities is not arbitrarily capped at a certain length, then such a distribution would have to favor shorter strings over longer ones in much the same way as the Solomonoff prior over programs (because if it doesn't, then its sum will diverge, etc.). But then this yields a prior that is constantly predicting that the universe will end at every moment, and is continually surprised when it keeps on existing. I'm not sure if this is logically inconsistent, but at least it seems useless for any practical purpose.

comment by quetzal_rainbow · 2025-02-10T21:10:46.488Z · LW(p) · GW(p)

I think there is a reducibility from one to another using different UTMs? I.e., for example, causal networks are Turing-complete, therefore, you can write UTM that explicitly takes description of initial conditions, causal time evolution law and every SI-simple hypothesis here will correspond to simple causal-network hypothesis. And you can find the same correspondence for arbitrary ontologies which allow for Turing-complete computations.

comment by JBlack · 2025-02-13T01:59:05.321Z · LW(p) · GW(p)

One thing that seems worth exploring from a conceptual point of view is doing away with priors altogether, and working more directly with metrics such as "what are the most expected-value rewarding actions that a bounded agent can make given the evidence so far". I suspect that from this point of view it doesn't much matter whether you use a computational basis such as Turing machines, something more abstract, or even something more concrete such as energy required to assemble and run a predicting machine.

From a computing point of view not all simple models will require the fewest resources available to actually follow it through and compare against other models, so in that respect this differs from Solomonoff induction (which assumes a system of infinite computing power). However, my expectation is that there usually would be some model only a bit more complex than the simplest that makes similar predictions at lower cost. A heuristic that examines models in a simplest-first order (but discarding ones that look expensive to evaluate) may well end up being close to optimal in trading off prediction accuracy against whatever costs there are of evaluating multiple models. There are exponentially fewer simple models to evaluate.

Replies from: antimonyanthony
comment by Anthony DiGiovanni (antimonyanthony) · 2025-02-13T18:17:52.836Z · LW(p) · GW(p)

working more directly with metrics such as "what are the most expected-value rewarding actions that a bounded agent can make given the evidence so far"

I'm not sure I exactly understand your argument, but it seems like this doesn't avoid the problem of priors, because what's the distribution w.r.t. which you define "expected-value rewarding"?

Replies from: JBlack
comment by JBlack · 2025-02-14T02:33:16.464Z · LW(p) · GW(p)

It doesn't completely avoid the problem of priors, just the problem of arbitrarily fixing a specific type of update rule on fixed priors such as in Solomonoff induction. You can't afford this if you're a bounded agent, and a Solomonoff inductor can only get away with it since it has not just unbounded resources but actually infinite computational power in any given time period.

A bounded agent needs needs to be able to evaluate alternative priors, update rules, and heuristics in addition to the evidence and predictions themselves, or it won't even approximate bounded rationality. While this is a more complicated scenario than the Solomonoff updater in some senses, it is philosophically simpler since we can view it more like a "bootstrap" process and ask what sort of bootstrapping might "generally" do well rather than taking anything as fixed.

I suspect that heuristics that score highly involve "universal" but finite systems (such as Turing machines or other mathematical structures capable of representing their own rules), and a "simple and not too costly" evaluation heuristic (not just simplicity).

There would be "degenerate" distributions of universe rules that would be exceptions, so there is still a problem of describing what sort of distributions I'm thinking of as being "degenerate", and naturally this whole sort of statement is too vague to prove any such thing even if such proofs weren't famously difficult (and plausibly impossible to prove even if not false).