How is Solomonoff induction calculated in practice?
post by Bucky · 2019-06-04T10:11:37.310Z · LW · GW · 6 commentsThis is a question post.
Contents
Answers 23 Richard_Kennaway 1 shminux None 6 comments
Solomonoff induction is generally given as the correct way to penalise more complex hypotheses when calculating priors. A great introduction can be found here [LW · GW].
My question is, how is this actually calculated in practice?
As an example, say I have 2 hypotheses:
A. The probability distribution of the output is given by the same normal distribution for all inputs, with mean and standard deviation .
B. The probability distribution of the output is given by a normal distribution depending on an input with mean and standard deviation .
It is clear that hypothesis B is more complex (using an additional input [], having an additional parameter [] and requiring 2 additional operations to calculate) but how does one calculate the actual penalty that B should be given vs A?
Answers
The short answer is, you can't. Solomonoff induction is not computable, and moreover depends on your model of computation. Asymptotically, the model of computation makes no difference, but for any finite set of examples, there is (for example) a universal Turing machine with short codes for those examples built in.
Practical methods of choosing among models use completely different methods. One collection of such methods is called regularization.
↑ comment by Bucky · 2019-06-05T21:13:15.838Z · LW(p) · GW(p)
Well that explains why I was struggling to find anything online!
Thanks for the link, I’ve been going through some of the techniques.
Using AIC the penalty for each additional parameter is a factor of e. For BIC the equivalent is so the more samples the more penalised a complex model is. For large n the models diverge - are there principled methods for choosing which regularisation to use?
Replies from: Richard_Kennaway↑ comment by Richard_Kennaway · 2019-06-07T16:37:04.415Z · LW(p) · GW(p)
At this point I reveal that I just play a statistician on the net. I don't know how people choose from among the many methods available. Is there a statistician in the house?
↑ comment by Gurkenglas · 2019-06-13T12:03:28.490Z · LW(p) · GW(p)
Solomonoff induction is not computable because its hypothesis space is infinite, but Bucky is only asking about a finite subset.
Replies from: Richard_Kennaway↑ comment by Richard_Kennaway · 2019-06-13T17:46:56.864Z · LW(p) · GW(p)
Solomonoff induction is uncomputable because it requires knowing the set of all programs that compute a given set of data. If you just have two hypotheses in front of you, "Solomonoff induction" is not quite the term, as strictly understood it is a method for extrapolating a given sequence of data, rather than choosing between two programs that would generate the data seen so far. But understanding it as referring to the general idea of assigning probabilities to programs by their length, these are still uncomputable if one considers only programs that are of minimal length in their equivalence class. And when you don't make that requirement, the concepts of algorithmic complexity have little to say about the example.
In practice no one honestly computes the complexity of models in any of the fields I am familiar with, such as physics, bio, chem etc. Sometimes they count the number of parameters/degrees of freedom, like in your example. In reality there is a dearth of models that explain observations and predict something new, interesting and testable, so the issue rarely arises.
↑ comment by Charlie Steiner · 2019-06-05T11:53:27.708Z · LW(p) · GW(p)
Minimum message length fitting uses an approximation of K-complexity and gets used sometimes when people want to fit weird curves in a sort of principled way. But "real" Solomonoff induction is about feeding literally all of your sensory data into the algorithm to get predictions for the future, not just fitting curves.
So I guess I'd say that it's possible to approximate K-complexity and use that in your prior for curve fitting, and people sometimes do that. But that's not necessarily going to be your best estimate, because your best estimate is going to take into account all of the data you've already seen, which becomes impossible very quickly (even if you just want a controlled approximation).
6 comments
Comments sorted by top scores.
comment by Pattern · 2019-06-04T18:44:34.362Z · LW(p) · GW(p)
First of all, this sounds weird - Solomonoff induction isn't about scoring two hypothesis. It includes something which does (something like) that, and then it does that for all possible programs/hypothesis (which is one reason why people are saying it's uncomputable*) and then it has a universal prior, and as it gets new evidence it updates that (infinite) prior probability distribution (which is the second reason why it's uncomputable*).
*a.k.a.: this takes literally forever.
Replies from: Pattern, Charlie Steiner↑ comment by Pattern · 2019-06-04T18:48:53.292Z · LW(p) · GW(p)
AIXI is related. Being based on Solomonoff induction, it is also incomputable. However, there's AIXItl which is an approximation (with memory and time contraints), and it's okay at something like first person pac-man, so there probably is an approximation to Solomonoff induction. I don't know how useful it is, and I've never seen it used.
↑ comment by Charlie Steiner · 2019-06-05T11:44:54.768Z · LW(p) · GW(p)
Sure. The OP might more accurately have asked "How is the Solomonoff prior calculated?"
Replies from: Buckycomment by avturchin · 2019-06-04T19:25:28.312Z · LW(p) · GW(p)
I also was interested in this question as it implies different answers about the nature of Occam razor. If the probability (from complexity) function quickly diminish, the simplest outcome is the most probable. However, if this function has a fat tail, its median could be somewhere half way to infinite complexity, which means that most true theories as incredible complex.
Replies from: Gurkenglas↑ comment by Gurkenglas · 2019-06-13T12:24:35.637Z · LW(p) · GW(p)
How do we choose the correct version of Occam's razor to use? As always, we use Occam's razor to give prior probabilities to each possibility (each version of Occam's razor), then update using real-world observations. There's a problem of circularity here, of course. I think that the version that humans intuitively use lies in a large region of the space of versions such that if you use one version from the region to choose a new version, and repeat this self-reflection, the process converges.