Computable Universal Prior

post by Ronny Fernandez (ronny-fernandez) · 2015-12-11T09:54:24.935Z · LW · GW · Legacy · 5 comments

Contents

5 comments

Suppose instead of using 2^-K(H) we just use 2^-length(H), does this do something obviously stupid? 

Here's what I'm proposing:

Take a programing language with two characters. Assign each program a prior of 2^-length(program). If the program outputs some string, then P(string | program) = 1, else it equals 0. I figure there must be some reason people don't do this already, or else there's a bunch of people doing it. I'd be real happy to find out about either. 

Clearly, it isn't a probability distribution, but we can still use it, no? 

 

 

5 comments

Comments sorted by top scores.

comment by V_V · 2015-12-11T14:54:42.340Z · LW(p) · GW(p)

Take a programing language with two characters. Assign each program a prior of 2^-length(program). If the program outputs some string, then P(string | program) = 1, else it equals 0.

This is exactly Solomonoff induction, assuming that the programming language is prefix-free.

I figure there must be some reason people don't do this already, or else there's a bunch of people doing it. I'd be real happy to find out about either.

Because it's uncomputable: there are infinitely many programs to consider, and there are programs that don't halt and for many of them, you can't prove that they don't halt.

If you set a maximum program length and a maximum execution time, then inference becomes computable, albeit combinatorially hard: maximum a posteriori inference can be easily reduced to MAX-SAT or ILP and is therefore NP-complete (optimization version), while posterior evaluation is #P-complete.

It's known that many NP-hard problems are practically solvable for real-world instances. In this case, there are indeed people (primarily at Microsoft, I think) who managed to make it work, but only for short program lengths in limited application domains: Survey paper.

Replies from: solipsist
comment by solipsist · 2015-12-12T01:06:29.495Z · LW(p) · GW(p)

Spit balling hacks around this:

  • Weigh hypotheses based on how many steps it takes for them to be computed on a dovetailing of all Turing machines. This would probably put too much weight on programs that are large but fast to compute.

  • Weigh hypotheses on how much space it takes to compute them. So dovetail all turing machines of size up to n limited to n bits of space for at most 2^n steps. This has the nice property that the prior is, like the hypotheses, space limited (using about twice as much space as the hypothesis).

  • Find some quantum algorithm that uses n^k qubits and polynomial time to magically evaluate all programs of length of n in some semi-reasonable way. If such a beast exists (which I doubt), it has the nice property that it "considers" all reasonably sized hypotheses yet runs in polynomial space and time.

  • Given n bits of evidence about the universe, consider all programs of length up to k*log(n) run for at most n^k2 steps. This has the nice property that it runs in polynomial time and space.

comment by gjm · 2015-12-11T11:12:32.065Z · LW(p) · GW(p)

I'm not sure this is actually more computable than the Solomonoff prior. Now, instead of having to do a potentially infinite amount of work to find the shortest program that behaves in a particular way, you have to do a definitely infinite amount of work to find all the programs that behave in a particular way.

Could you give an example of an inference problem that's made easier by doing this?

comment by gjm · 2015-12-11T11:10:22.806Z · LW(p) · GW(p)

Clearly, it isn't a probability distribution

I don't think this is a problem. Make your programming language be prefix-free, so that every infinite string of bits begins with exactly one legal program. Then 2^-length(program) is a probability distribution over programs. (This is the same trick that one uses with the Solomonoff prior.)

comment by Manfred · 2015-12-11T17:44:32.511Z · LW(p) · GW(p)

If you want a computable universal prior, forget about turing machines, because the halting problem will always get you in the end.

One option is to just say that the probability of observing a sequence of bits is equal to 2^-L. This is the maximum entropy prior, which is perfectly easy to compute. It's also useless, because it doesn't have any starting information. It's a prior without any "prior." The Solomonoff prior (which you outlined) is useful at predicting computable sequences, but at the cost of making many overconfident mistakes when the sequence is not computable / infinitely K-complex.

If you want a computable prior that's still useful, one approach might be to think of sequence-generating methods that 1) appear in the real world and 2) don't have the halting problem (maybe look into finite state machines, eh?), and then make a prior that is useful for predicting those sequences, at the cost of making overconfident mistakes when the sequence is of infinite "complexity" in that class.