Efficient Induction

post by paulfchristiano · 2010-12-27T10:40:38.829Z · LW · GW · Legacy · 25 comments

(By which I mean, induction over efficient hypotheses.)

A standard prior is "uniform" over the class of computable functions (ie, uniform over infinite strings which have a prefix which compiles). Why is this a natural choice of prior? Well, we've looked at the universe for a really long time, and it seems to be computable. The Church-Turing thesis says we have no reason to choose a bigger support for our prior.

Why stop there? There is a natural generalization of the Church-Turing thesis, naturally called the extended Church-Turing thesis, which asserts that the universe is computable in polynomial time. In fact, we have a strong suspicion that physics should be local, which means more or less precisely that updates can be performed using a linear size logical circuit. Maybe we should restrict our prior further, looking only for small circuits.

(As we realized only slightly later, this extended hypothesis is almost certainly false, because in the real world probabilities are quantum. But if we replace "circuit" by "quantum circuit," which is a much less arbitrary change than it seems at face value, then we are good. Are further changes forthcoming? I don't know, but I suspect not.)

So we have two nice questions. First, what does a prior over efficiently computable hypotheses look like? Second, what sort of observation about the world could cause you to modify Solomonoff induction? For that matter, what sort of physical evidence ever convinced us that Solomonoff induction was a good idea in the first place, rather than a broader prior? I suspect that both of these questions have been tackled before, but google has failed me so now I will repeat some observations.

Defining priors over "polynomially large" objects is a little more subtle than usual Solomonoff induction. In some sense we need to penalize a hypothesis both for its description complexity and its computational complexity. Here is a first try:

A hypothesis consists of an initial state of some length N, and a logical circuit of size M which takes N bits to K+N bits. The universe evolves by repeatedly applying the logical circuit to compute both a "next observation" (the first K bits) and the new state of the universe (the last N bits). The probability of a hypothesis drops off exponentially with its length.

How reasonable is this? Why, not at all. The size of our hypothesis is the size of the universe, so it is going to take an awful lot of observations to surmount the improbability of living in a large universe. So what to do? Well, the reason this contradicts intuition is that we expect our physical theory (as well as the initial state of our system) to be uniform in some sense, so that it can hope to be described concisely even if the universe is large. Well luckily for us the notion of uniformity already exists for circuits, and in fact appears to be the correct notion. Instead of working with circuits directly, we specify a program which outputs that circuit (so if the laws of physics are uniform across space, the program just has to be told "tile this simple update rule at every point.") So now it goes like this:

A hypothesis consists of a program which outputs an initial state of some finite length N, and a program which outputs a logical circuit of size M which takes N bits to K+N bits. The observations are defined as before. The probability of a hypothesis drops off exponentially with its length.

How reasonable is this? Well it doesn't exploit the conjectured computational efficiency of the universe at all. There are three measures of complexity, and we are only using one of them. We have the length of the hypothesis, the size of the universe, and the computational complexity of the update rule. At least we now have these quantities in hand, so we can hope to incorporate them intelligently. One solution is to place an explicit bound on the complexity of the update rule in terms of the size of the universe. It is easy to see that this approach is doomed to fail. An alternative approach is to explicitly include terms dependent on all three complexity measures in the prior probability for each hypothesis.

There are some aesthetically pleasing solutions which I find really attractive. For example, make the hypothesis a space bounded Turing machine and also require it to specify the initial state of its R/W tape. More simply but less aesthetically, you could just penalize a hypothesis based on the logarithm of its running time (since this bounds the size of its output), or on log M. I think this scheme gives known physical theories very good description complexity. Overall, it strikes me as an interesting way of thinking about things.

I don't really know how to answer the second question (what sort of observation would cause us to restrict our prior in this way). I don't really know where the description complexity prior comes from either; it feels obvious to me that the universe is computable, just like it feels obvious that the universe is efficiently computable. I don't trust these feelings of obviousness, since they are coming from my brain. I guess the other justification is that we might as well stick to computable hypotheses, because we can't use stronger hypotheses to generate predictions (our conscious thoughts at least appearing to be computable). The same logic does have something to say about efficiency: we can't use inefficient hypotheses to generate predictions, so we might as well toss them out. But this would lead us to just keep all sufficiently efficient hypotheses and throw out the rest, which doesn't work very well (since the efficiency of a hypothesis depends upon the size of the universe it posits; this basically involves putting a cap on the size of the universe). I don't know of a similar justification which tells us to penalize hypotheses based on their complexity. The only heuristic is that it has worked well for screening out physical theories so far. Thats a pretty good thing to have going for you at least.

In sum, thinking about priors over more restricted sets of hypotheses can be interesting. If anyone knows of a source which approaches this problem more carefully, I would be interested to learn.

comment by Manfred · 2010-12-27T20:58:30.689Z · LW(p) · GW(p)

You know more about this than me - is the three-body problem computable by the technical definition?

Replies from: JoshuaZ, paulfchristiano
comment by JoshuaZ · 2010-12-28T01:26:10.603Z · LW(p) · GW(p)

Ok. I really want to know why this question got voted down. It seemed like a perfectly reasonable question for someone to ask if they don't know much about these subjects and have heard that the three body problem is unsolvable.

comment by paulfchristiano · 2010-12-27T21:32:28.836Z · LW(p) · GW(p)

You can numerically integrate the three-body problem. So there is a linear time algorithm to approximately compute what will happen after linear time. There just isn't a logarithmic time algorithm (which is what we would normally mean by "closed form").

Replies from: Manfred
comment by Manfred · 2010-12-28T02:23:49.983Z · LW(p) · GW(p)

Ah, okay. So all that matters is that the right answer is computable given infinite resources. How do you reconcile this with the apparently finite computational resources of the universe?

Replies from: Sniffnoy, paulfchristiano
comment by Sniffnoy · 2010-12-28T02:41:35.832Z · LW(p) · GW(p)

To be clear - the three-body problem deals with real numbers. Computers can't work directly with real numbers, since real numbers can't be finitely represented. Hence when we speak about the computability of a problem "compute f(x)" that calls for a real number answer, we really mean the computability of the problem "compute f(x) to within epsilon" (where now epsilon is another input to the problem).

comment by paulfchristiano · 2010-12-28T02:33:18.928Z · LW(p) · GW(p)

You can compute what will happen efficiently. If you want to know where the 3 bodies are after t seconds to within some constant error, it only takes you about O(t) steps.

Replies from: rwallace, Manfred
comment by rwallace · 2010-12-28T04:26:29.663Z · LW(p) · GW(p)

But errors accumulate over time. That means if you want the error to be bounded by a constant at the end of the calculation, the larger t is, the more you have to do along the way to avoid or correct error. If that just means you have to use higher precision calculations, that would give O(t log t), but if it means you have to use finer time steps, that would make the computational effort O(t^2) or worse.

Replies from: JoshuaZ
comment by JoshuaZ · 2010-12-28T04:54:31.266Z · LW(p) · GW(p)

This is primarily a function of how accurately you can do division and multiplication. Even if it isn't O(t) it is almost probably O(t^(1+epsilon)) for any epsilon>0.

Replies from: rwallace
comment by rwallace · 2010-12-28T11:08:55.494Z · LW(p) · GW(p)

Suppose for the sake of argument you had infinite precision division and multiplication. There would still be finite error due to the use of finite time step size. (Runge-Kutta methods etc. give smaller error for a given step size than the more obvious numerical integration methods, but still not zero.) If you want to reduce the error, you need to use a smaller step size. Generally speaking, the error is a polynomial function of the step size (unlike arithmetic error, which decreases exponentially with the number of digits of precision), so you would expect O(t^(1+epsilon)) to be unattainable. Unless there's some method of reducing error exponentially with step size for the three body problem that I'm missing?

Replies from: paulfchristiano, JoshuaZ
comment by paulfchristiano · 2010-12-28T23:17:58.410Z · LW(p) · GW(p)

Runge-Kutta takes O(delta^(-1/4)) time to get an approximation quality of delta, I think. I don't know if we can yet, but I suspect is is possible to get an approximation quality of delta in time O(delta^(-epsilon)) for any epsilon>0 (in the same sense that I suspect it will eventually possible to multiply two nxn matrices in time O(n^(2 + epsilon)) for any epsilon>0, even though its not practical at all). This would probably imply JoshuaZ's stated time bound. It doesn't require exponentially fast error reduction, just arbitrarily good polynomial error reduction.

Anyway, the model I described in the post doesn't actually have this problem. More precision just comes from using a finer discrete system to approximate the universe (if in fact it is continuous, which I would put a probability of less than 50% on) and still using a linear size circuit to do the simulation. You only pay logarithmically for using a finer grid, in any of the schemes I proposed.

Replies from: rwallace
comment by rwallace · 2010-12-29T01:22:20.602Z · LW(p) · GW(p)

An infinite sequence of algorithms converging on arbitrarily good polynomial error reduction? Fair enough, I certainly can't rule that out at this stage.

But I don't understand your last point: how can you pay only logarithmically for using a finer grid?

Replies from: paulfchristiano
comment by paulfchristiano · 2010-12-29T01:41:54.730Z · LW(p) · GW(p)

The post had a concrete complexity measure, which pays logarithmically for a finer grid (that is, doubling the size of the universe is the same as adding one more bit of complexity). The point is, you can only afford to pay logarithmically in the size of the universe (if you want known physical theories to have good complexity as compared to stupid explanations for our observations). Making the grid twice as fine is just making the universe twice as large, so you only pay 1 more bit: the extra bit needed to describe the larger size of the universe. If you disagree with this then you probably disagree fundamentally with my approach. That is obviously valid; I don't really like my approach that much. But alternative approaches, like the speed prior, seem much worse to me.

Replies from: rwallace
comment by rwallace · 2010-12-29T05:15:09.650Z · LW(p) · GW(p)

Oh, sorry, yes, when your cost measure is complexity, then a finer grid incurs at most a logarithmic penalty, I agree. I also agreed the speed prior is a much worse approach -- I would go so far as to say it is flat-out falsified by the observed extreme computational cost of physics.

comment by JoshuaZ · 2010-12-28T13:27:37.021Z · LW(p) · GW(p)

Hmm, yes you are correct. I was being stupid.

comment by Manfred · 2010-12-28T03:08:28.897Z · LW(p) · GW(p)

Well, it's simple to find a chaotic problem that's not efficient. I was just trying to understand what "the universe is computable" really means since the universe isn't exactly computable.

Replies from: saturn
comment by saturn · 2010-12-28T06:06:43.996Z · LW(p) · GW(p)

It seems like you and some others in this thread are assuming that real numbers describe some actual behavior of the universe, but that's begging the question. If the universe is computable, it implies that all quantities are discrete.

Replies from: rwallace, Manfred
comment by rwallace · 2010-12-28T11:55:47.094Z · LW(p) · GW(p)

Well, if it turns out the universe is continuous, then when we conjecture it to be computable, we typically mean the same thing we mean when we say pi is computable: there exists a fixed length program that could compute it to any desired degree of precision (assuming initial conditions specified to sufficient precision).

comment by Manfred · 2010-12-30T01:38:09.480Z · LW(p) · GW(p)

Continuous quantities are the simplest explanation for the evidence we have - there are some hints that it could be otherwise, but they're only hints.

comment by JoshuaZ · 2010-12-27T15:29:54.193Z · LW(p) · GW(p)

We can sort of see what might cause someone to change their views on what their generic priors should look like by looking at semi-historical examples.

Early on in the theory of computable functions it seemed like all computable functions might be primitive recursive. Presumably, if one didn't know about things llike the Ackermann function, you'd have no issue phrasing a general prior in terms of primitive recursive functions.

Another example is actually in your text, where people realized that quantum systems seemed to be able to do things that non-quantum systems could not (within certain time bounds).

Thus. updating our general framework seems to occur when aspects of the universe around us suggest that modeling them requires a larger class of functions than we anticipated. In the case of primitive recursive functions, it turned out that empirically the human brain could calculate a specific function that wasn't primitive recursive.

For what it is worth, I don't share your confidence that priors won't need to be drastically re-examined. One issue that Eliezer and others have observed with the Solomonoff prior is that it assigns equal probability to different ideas regardless of the computational time. While privileging polynomial time descriptions might help, it isn't clear how one should distinguish between two Turing machines, one which runs in very short time (say degree 2) and another that is long (say degree 20) but the degree 2 has a much smaller number of states. Which one of those is simpler?

Replies from: Kazuo_Thow, paulfchristiano
comment by Kazuo_Thow · 2010-12-27T19:41:14.553Z · LW(p) · GW(p)

On the problem of distinguishing between Turing machines of the kinds you mentioned, does Jürgen Schmidhuber's idea of a speed prior help at all? Searching for "speed prior" here on Less Wrong didn't really turn up any previous discussion.

Replies from: timtyler, JoshuaZ
comment by timtyler · 2010-12-31T15:31:56.118Z · LW(p) · GW(p)

I discuss that concept here: http://alife.co.uk/essays/the_one_true_razor/

comment by JoshuaZ · 2010-12-27T19:52:50.950Z · LW(p) · GW(p)

Hmm, I had not seen the speed prior before. It seems to make strong testable predictions about how the universe functions. I'll have to look into it.

comment by paulfchristiano · 2010-12-27T19:45:43.688Z · LW(p) · GW(p)

While privileging polynomial time descriptions might help, it isn't clear how one should distinguish between two Turing machines, one which runs in very short time (say degree 2) and another that is long (say degree 20) but the degree 2 has a much smaller number of states. Which one of those is simpler?

I have given one answer in the post, selected somewhat arbitrarily (but with the desirable property of working more or less correctly for physical theories we have seen so far). I think basing things on circuits rather than TM's is clearly a better start.

For one thing, I don't know what "degree 2" means for you. Is that in terms of the size of the universe? The # of the current observation? Both of these have significant problems (the first fails if the universe gets padded with junk, the second fails if your observations don't begin at the beginning of time).

For a second thing, Turing machines are a horrible way to measure the degree of polynomial run-times. There is basically no reason to believe that you get out a reasonable thing, unlike with circuits. Of course, using circuits introduces other issues, which require new solutions. I suggest one.

Of course, the post just contains my best try. I would be happy to see a better go at it, but I would be very surprised if it was based on any inherently uniform model of computation which is known today.

Replies from: JoshuaZ
comment by JoshuaZ · 2010-12-27T19:53:28.586Z · LW(p) · GW(p)

I was thinking in terms of number of observations. I agree that this has problems.

comment by taw · 2010-12-31T06:29:03.396Z · LW(p) · GW(p)

You're using phrases like "universe is computable in polynomial time", and I understand your intuition, but such phrases don't really mean anything.