# Open Problems Related to Solomonoff Induction

post by Wei_Dai · 2012-06-06T00:26:10.035Z · score: 27 (28 votes) · LW · GW · Legacy · 106 comments## Contents

Apparent Unformalizability of “Actual” Induction Is Solomonoff Induction “good enough”? Is complexity objective? Is Solomonoff an ideal or an approximation? How can we apply Solomonoff when our inputs are not symbol strings? What does Solomonoff Induction actually say? None 103 comments

Solomonoff Induction seems clearly "on the right track", but there are a number of problems with it that I've been puzzling over for several years and have not made much progress on. I think I've talked about all of them in various comments in the past, but never collected them in one place.

#### Apparent Unformalizability of “Actual” Induction

##### Argument via Tarski’s Indefinability of Truth

Informally, the theorem states that arithmetical truth cannot be defined in arithmetic. The theorem applies more generally to any sufficiently strong formal system, showing that truth in the standard model of the system cannot be defined within the system.

Suppose we define a generalized version of Solomonoff Induction based on some second-order logic. The truth predicate for this logic can’t be defined within the logic and therefore a device that can decide the truth value of arbitrary statements in this logical has no finite description within this logic. If an alien claimed to have such a device, this generalized Solomonoff induction would assign the hypothesis that they're telling the truth zero probability, whereas we would assign it some small but positive probability.

##### Argument via Berry’s Paradox

Consider an arbitrary probability distribution P, and the smallest integer (or the lexicographically least object) x such that P(x) < 1/3^^^3 (in Knuth's up-arrow notation). Since x has a short description, a universal distribution shouldn't assign it such a low probability, but P does, so P can't be a universal distribution.

#### Is Solomonoff Induction “good enough”?

Given the above, is Solomonoff Induction nevertheless “good enough” for practical purposes? In other words, would an AI programmed to approximate Solomonoff Induction do as well as any other possible agent we might build, even though it wouldn’t have what we’d consider correct beliefs?

#### Is complexity objective?

Solomonoff Induction is supposed to be a formalization of Occam’s Razor, and it’s confusing that the formalization has a free parameter in the form of a universal Turing machine that is used to define the notion of complexity. What’s the significance of the fact that we can’t seem to define a parameterless concept of complexity? That complexity is subjective?

#### Is Solomonoff an ideal or an approximation?

Is it the case that the universal prior (or some suitable generalization of it that somehow overcomes the above "unformalizability problems") is the “true” prior and that Solomonoff Induction represents idealized reasoning, or does Solomonoff just “work well enough” (in some sense) at approximating any rational agent?

#### How can we apply Solomonoff when our inputs are not symbol strings?

Solomonoff Induction is defined over symbol strings (for example bit strings) but our perceptions are made of “qualia” instead of symbols. How is Solomonoff Induction supposed to work for us?

#### What does Solomonoff Induction actually say?

What does Solomonoff Induction actually say about, for example, whether we live in a creatorless universe that runs on physics? Or the Simulation Argument?

## 106 comments

Comments sorted by top scores.

Consider an arbitrary probability distribution P, and the smallest integer (or the lexicographically least object) x such that P(x) < 1/3^^^3 (in Knuth's up-arrow notation). Since x has a short description, a universal distribution shouldn't assign it such a low probability, but P does, so P can't be a universal distribution.

Solomonoff Induction would not consider this a description of x as it cannot be used to compute x.

I guess I failed to bridge the inferential gap here. You're right "Solomonoff Induction would not consider this a description of x as it cannot be used to compute x." The point I'm trying to make is that a correct formalization of induction/Occam's Razor ought (in order to satisfy our intuitions) to assign x some probability > 1/3^^^3 since it has a short (even if uncomputable) description, but a formalization based on this P would not, therefore it can't be a correct formalization. But this is the case no matter what P we use (with different x depending on P), hence the "unformalizability" problem.

Could you explain further? Why "ought" it assign it such a probability? As stated, this seems more convincing as an argument that it "ought not" to assign a probability > 1/3^^^3, despite the short "description".

The idea is, however we define P, how can we be *that* sure that there isn't some kind of uncomputable physics that would allow someone to build a device that can find the lexicographically least object x such that P(x) < 1/3^^^3 and present it us?

Maybe we should just work off of the assumption that there's no relevant uncomputable physics, because if there were, we should probably give up our endeavors anyways, unless we knew how to model an uncomputable reality within a computable AGI-program. As Schmidhuber ever so aptly wrote on his homepage:

The distribution should at least be computable in the limit. That is, there should exist a program that takes as an input any beginning of the universe history as well as a next possible event, and produces an output converging on the conditional probability of the event. If there were no such program we could not even formally specify our universe, leave alone writing reasonable scientific papers about it.

unless we knew how to model an uncomputable reality within a computable AGI-program

You could start out by trying to understand how an AI might invent the *concept* of uncomputability, and how it might then proceed to the possibility of uncomputable physics. And one way to get started here is by thinking as a cognitive historian, and asking how *humans* came up with the concept of uncomputability.

"Give up" is neither a well-defined strategy, nor one that was shown to be preferable to its alternatives.

That gives you a probability inversely proportional to the integral of e^-(the description length) of each number from 0 to infinity. Complexity grows like log(n)-ish. It's an improper prior.

So I agree with Adele that this may be a modus tollens / modus ponens moment.

If there's some uncomputable physics that would allow someone to build such a device, we ought to redefine what we mean by computable to include whatever the device outputs. After all, said device falsifies the Church-Turing thesis, which forms the basis for our definition of "computable".

Suppose we define a generalized version of Solomonoff Induction based on some second-order logic. The truth predicate for this logic can’t be defined within the logic and therefore a device that can decide the truth value of arbitrary statements in this logical has no finite description within this logic. If an alien claimed to have such a device, this generalized Solomonoff induction would assign the hypothesis that they're telling the truth zero probability, whereas we would assign it some small but positive probability.

It seems to me that the paradox may lie within this problem setup, not within the agent doing the induction.

We first consider that, rather than this device being assigned zero probability, it should actually be *inconceivable* to the agent - there should not be a finitely describable thingy that the agent assigns zero probability of having a finitely describable property.

Why would an agent using a second-order analogue of Solomonoff induction have such conceptual problems? Well, considering how Tarski's original undefinability theorems worked, perhaps what goes wrong is this: we want to believe that the device outputs the truth of statements about the universe. But we also want to believe this device is *in* the universe. So what happens if we ask the device, "Does the universe entail the sentence stating that outputs 'No' in response to a question which looks like ?"

Thus, such a device is inconceivable in the first place since it has no consistent model, and we are actually correct to assign zero probability to the alien's assertion that the device produces correct questions to all questions about the universe including questions about the device itself.

we want to believe that the device outputs the truth of statements about the universe

I'm not sure why you say this, because the device is supposed to output the truth of statements about some second-order logic, not about the universe. The device is not describable by the second-order logic via Tarski, so if the device is in the universe, the universe must only be describable by some meta-logic, which implies the device is outputting truth of statements about something strictly simpler than the universe. The agent ought to be able to conceive of this...

In that case I'm not sure what you mean by "second-order analogue of Solomonoff induction". Solomonoff induction is about the universe and proceeds from sensory experiences. What does it mean to induct that something may produce the truth of sentences "about a second-order logic"?

What does it mean to induct that something may produce the truth of sentences "about a second-order logic"?

I mean suppose you encounter a device that outputs "true" or "false" whenever you feed it a sentence in some second-order logic, and after doing a lot of tests, you think maybe the device outputs "true" if and only if the input sentence is actually true, regardless of what input it receives. This seems like a straightforward analogue of inducting that something may be a Halting Oracle, which I thought you agreed an agent should be able to do?

I'm not sure if this answers your question. If not, feel free to find me on Skype or Google Chat to talk more.

By "true" do you mean that the physical universe semantically entails the sentence, or that the sentence is a semantic tautology of second-order logic? I'm assuming the latter, since the former case is what I was assuming when I gave my Tarski reply.

So... I'm pretty sure you can represent a set's semantic entailment of a Henkin interpretation of a second-order sentence in a first-order set theory. I'm less sure that you can represent entailment of a second-order sentence inside a second-order set theory, but I'm having trouble seeing why you *wouldn't* be able to do that. I also think that second-order set theories are supposed to be finitely axiomatizable, though I could be wrong about this. But then I'm not *quite* sure why we don't get into Tarskian trouble.

Let ST be the finite axiomatization of a second-order set theory. Then in second-order logic, why doesn't the sentence (ST->"All sets: Set entails S") form a truth predicate for S? My guess is that there's just no theorem which says, "X is *true* (entailed by the mathematical universal set / model we're currently operating inside) iff 'X' is entailed by all sets (inside the current model)".

If this is so, then it looks to me like for any given set of axioms corresponding to a second-order set theory, second-order logic can represent the idea of a device that outputs sentences which are semantically entailed by every individual set *inside* that theory. So the new answer would be that you are welcome to hypothesize that a device prints out truths of second-order logic, for any given background second-order set theory which provides a universe of models against which those sentences are judged universally semantically true.

In which case the indefinite extensibility gets packed into the choice of set theory that we think this device is judging second-order validity for, if I haven't dropped the bucket somewhere along the way.

Then in second-order logic, why doesn't the sentence (ST->"All sets: Set entails S") form a truth predicate for S?

Let S="ST -> false". Then S is false in second-order logic (assuming ST is consistent), but (ST->"All sets: Set entails S") is true, because ST's model has to be bigger than any set (I think it's usually taken to be the proper class of all sets), so every set entails "Not ST".

So the new answer would be that you are welcome to hypothesize that a device prints out truths of second-order logic, for any given background second-order set theory which provides a universe of models against which those sentences are judged universally semantically true.

On input S="ST -> false", your device prints out "true", while my device prints out "false". I still want to be able to hypothesize my device. :)

There are totally models of ZFC containing sets that are models of ZFC. See "Grothendieck universe". Is there a reason why it'd be different in second-order logic? I don't think a second-order set theory would pin down a *unique* model, why would it? Unless you had some axiom stating that there were no more ordinals past a certain point in which case you might be able to get a unique model. Unless I'm getting this all completely wrong, since I'm overrunning my expertise here.

So in retrospect I have to modify this for us to somehow suppose that the device is operating in a particular model of a second-order theory. And then my device prints out "true" (if it's in one of the smallest models) or the device prints out "false" (if it's in a larger model), unless the device is against the background of an ST with an upper bound imposing a unique model, in which case the device does print out "true" for ST -> false and from the outside, we think that this device is about a small collection of sets so this result is not surprising.

Then the question is whether it makes sense to imagine that the device is about the "largest relevant" model of a set theory - i.e., for any other similar devices, you think no other device will ever refer to a larger model than the current one, nor will any set theory successfully *force* a model larger than the current one - I think that's the point at which things get semantically interesting again.

Is there a reason why it'd be different in second-order logic?

Second-order set theory is beyond my expertise too, but I'm going by this paper, which on page 8 says:

We have managed to give a formal semantics for the second-order language of set theory without expanding our ontology to include classes that are not sets. The obvious alternative is to invoke the existence of proper classes. One can then tinker with the definition of a standard model so as to allow for a model with the (proper) class of all sets as its domain and the class of all ordered-pairs x, y (for x an element of y) as its interpretation function.12 The existence of such a model is in fact all it takes to render the truth of a sentence of the language of set theory an immediate consequence of its validity.

So I was taking the "obvious alternative" of proper class of all sets to be the standard model for second order set theory. I don't quite understand the paper's own proposed model, but I don't think it's a set either.

I'm not sure I believe in proper classes and in particular, I'm not sure there's a proper class of all sets that could be the model of a second-order theory such that you could not describe any set larger than the model, and as for pinning down that model using axioms I'm pretty sure you shouldn't be able to do that. There are analogues of the Lowenheim-Skolem theorem for sufficiently large infinities in second-order logic, I seem to recall reading.

**[deleted]**· 2012-06-06T17:56:30.874Z · score: 5 (5 votes) · LW(p) · GW(p)

Is complexity objective?

This one bother me as well. Turing machine basis for complexity favours certain types of theories over others. For example, how many bits of turing machine does it take to predict, say, maxwells equations, which involve *continuous fields*?

(warning: original "research" following)

One temptation is to sweep this under the rug by saying that the complexity required to talk about continuous math is just a constant-sized interpreter that you put on the turing machine, but this can't get around the fact that some thoeries *won't need that*.

(BTW, is continuous math even exactly turing-computable?)

If we try to generalize and say "turing machines suck, we'll use some other language", it becomes clear that all langauges will have some bias. English has a bias for agenty mind-stuff, turing machines have a bias for discrete cellular automata, some imaginable basis language will have a bias for simple continuous mathematics.

I think the basis that is closest to correct would be one based on a hypothetical real-number continuous-memory hypercomputer (for obvious reasons). This of course implies that I have learned this thing, which implies some deeper form of induction over forms of induction, that would make a language that had only one operator that was just "physics" take appropriate penalization. And then thinking about that idea reveals that inducting over langauges and inducting over theories in languages is suspiciously similar.

That tempts me to say that they should not be seperate, but then I still think the general utility of continuous math should not have to be proven for every theory that uses it, which implies some sharing of complexity somehow. But that seems like nonsense, SI already handles that, I think (by each thoery being a completely defactored package deal). Now I think it would be good to find a new SI that can handle factoring (for algorithmic reasons), but maybe that is a step down the "how do we actually do induction" road, which is not what SI is about.

Thinking over this all again, it seems starting with any form of induction that sortof works will eventually come to the right answer. The hypothetical perfect form of induction that would be the best a learning machine could start with is just the actual answer. Again there's that duality (is that the right word) between theories and induction priors.

This looks to me alot like some kind of iterative improvement algorithm, where you have to start somewhere, and where you start is chosen from the same domain as your answer (think newton's method, or gradient descent). It seems like even starting with some human language (like english), you would eventually learn (as we have done) that building a math-interpreter is the thing to do.

Ok, that is the end of my confusion dump.

**My unproven, badly specified idea that has something to do with induction:**

Ok, background: All turing-complete languages can build interpreters for all others. This implies some traversable languagespace (which is a subset of programspace) where the distance between two points is the length of the interpreter to get from one language to the other. We will also want to be able to go backwards, so that we can go from any point in programspace to any other (by backing up to a good language, and then going forward to new program). Going backwards means building this point from some point in languagespace, rather than building some point in programspace from this point in languagespace) Points in programspace are defined over behavior, so the same program will be reachable from multiple languages.

(the lisp-like area gets a lot of traffic thru it. ("any sufficiently complex program contains ... half of common lisp"))

Ok, so now we have a programspace with distances measured in bits. We could just put a probability distribution over the subset of programspace that makes predictions. Solomonof induction does this, with initial improbability defined by distance from turing-machine-langauge.

I think there might be some way to use this programspace idea for induction other than just being another way to look at SI programs. I'm imagining probability or something flowing around in this space updating our concept of simplicity so that when we see that newtonian mechanics, GR, and quantum mechanics are all closely reachable from some mathy point in languagespace, we then upgrade the probability of math-reachable thoeries, without even talking about what they predict. I'm also imagining if we build an AI to do induction, we say "here are some interesting points in programspace (newton, maxwell, SR, GR, quantum) that have served us well" and letting it figure out a distribution over programspace instead of "start from turing-machine-language".

I have more ideas and I fear I am not communicating this very well.

I think this interpreter-traversable programspace is a good concept to investigate for trying to make SI more objective. Even if we just stick with something like SI, it seems like an interesting concept.

yeah...

The thing that you're describing is a lot like finding the stationary distribution of a Markov chain. Not all Markov chains have stationary distributions and I can't tell whether this one would, though if it does have one it would be unique. It's an interesting idea.

I should also note that it is not necessary to do anything like this to preform induction over multiple theories. Instead, we can just require a program that outputs known theories in addition to the new theory. We can ask them to be output in any form, such as predictions or source code, since converting source code to predictions is O(1) in program size. Then, if the different theories have more in common, the program computing them all will be shorter than one that must seperately specify elements of very different theories. A slightly different implementation of the same general principle is Schmidhuber's Optimal Ordered Problem Solver.

Some continuous math is commutable, some is not. There are uncomputable real numbers. Computable number.

**[deleted]**· 2012-06-06T02:57:19.864Z · score: 3 (3 votes) · LW(p) · GW(p)

This is the second mention of second-order logical Solomonoff Induction, but I can't imagine what such a thing would look like.

It's Luke and Eliezer's term, but I guess the idea is similar to the one I had. You take some second-order theory, and let each string in the formal language that constitutes a description of X contribute n^-l to the probability of X, where n is the size of the alphabet, and l is the length of the string. Use the resulting distribution in place of the universal prior in Solomonoff Induction.

Let's say the theory is ZFC. Does the string "0 if the continuum hypothesis if true, otherwise 1" contribute to the probability of 0, or the probability of 1?

(As you most likely know, in the standard interpretation of second order ZFC the continuum hypothesis has a definite truth value, but it can't be proved or disproved in any of the deductive systems that have been proposed for second order ZFC.) My suggestion is that the string "0 if the continuum hypothesis if true, otherwise 1" contribute to the probability of 0 if the continuum hypothesis is true, and the probability of 1 if the continuum hypothesis is false. The problem that we don't know how to deal with a probability distribution like this seems similar to the problem of not knowing how to use the universal prior in the original Solomonoff Induction, both of which seem to be instances of logical/mathematical uncertainty, and there might be a general solution to that problem.

The reason I think this is plausible is that for example P=NP may not be provable or disprovable in any formal deductive system that have been proposed, and yet any agent would still have to make decisions that are affected by its truth value, such as whether to search for a polynomial solution to 3-SAT. If there is some sort of principled way to make those decisions, perhaps the same methods can be used to make decisions involving the continuum hypothesis?

All universal Turing machines can simulate each other with logarithmic slowdown. Saying that the parameter means that complexity "subjective" is like saying the time complexity of Quicksort is "subjective" because the algorithm doesn't specify which programming language to implement it in.

For aliens with a halting oracle:

Suppose the aliens have this machine that may or may not be a halting oracle. We give them a few Turing machine programs and they decide which ones halt and which ones don't. Then we run the programs. Sure enough, none of the ones they say run forever halt, and some of them they say don't run forever will halt at some point. Suppose we repeat this process a few times with different programs.

Now what method should we use to predict the point at which new programs halt? The best strategy seems to be to ask the aliens which ones halt, give the non-halting ones a 0 probability of halting at every step, and give the other ones some small nonzero probability of halting on every step. Of course this strategy only works if the aliens actually have a halting oracle so it its predictions should be linearly combined with a fallback strategy.

I think that Solomonoff induction will find this strategy, because the hypothesis that the aliens have a true halting oracle is formalizable. Here's how: we learn a function from aliens' answers -> distribution over when the programs halt. We can use the strategy of predicting that the ones that the aliens say don't halt don't halt, and using some fallback mechanism for predicting the ones that do halt. This strategy is computable so Solomonoff induction will find it.

For Berry's paradox:

This is a problem with every formalizable probability distribution. You can always define a sequence that says: see what bit the predictor predicts with a higher probability, then output the opposite. Luckily Solomonoff induction does the best it can by having the estimates converge to 50%. I don't see what a useful solution to this problem would even look like; it seems best to just treat the uncomputable sequence as subjectively random.

If I understand Solomonoff Induction correctly, for all n and p the the sum of the probabilities of all the hypotheses of length n equals the sum of the probabilities of hypotheses of length p. If this is the case, what normalization constant could you possibility use to make all the probabilities sum to one? It seems there are none.

Use a prefix-free encoding for the hypotheses. There's *not* 2^n hypotheses of length n: Some of the length-n bitstrings are incomplete and you'd need to add more bits in order to get a hypothesis; others are actually a length <n hypothesis plus some gibberish on the end.

Then the sum of the probabilities of all programs of all lengths combined is 1.0. After excluding the programs that don't halt, the normalization constant is Chaitin's Omega.

Unfortunately Chaitin's Omega's incomputable, but even if it wasn't I don't see how it would work as a normalizing constant. Chaitin's Omega is a real number, there is an infinite number of hypotheses, and (IIRC) there is no real number r such that r multiplied by infinite equals one, so I don't see how Chaitin's Omega could possible work as a normalizing constant.

If we assign a value of 2^-n to each hypothesis (that is, each halting program) of length n, then the total value summed over all hypotheses is Chaitin's Omega.

Thus, if we assign a value of 2^-n/Omega to each hypothesis, the total is 1, and this gives us a valid probability distribution.

That would assign a zero probability of hypotheses that take more than n bits to specify, would it not? That sounds far from optimal.

Sorry, I mean that we do this for each value of n. So, given a hypothesis H, it is assigned a probability of 2^-length(H)/Omega.

I still don't see how that would make all the hypotheses sum to 1. Wouldn't that only make the probabilities of all the hypotheses of length n sum to 1, and thus make the sum of all hypotheses sum to > 1? For example, consider all the hypotheses of length 1. Assuming Omega = 1 for simplicity, there are 2 hypotheses, each with a probability of 2^-1/1 = 0.5, making all them sum to 1. There are 4 hypotheses of length 2, each with a probability of 2^-2/1 = 0.25, making them sum to 1. Thus, the sum of the probabilities of all hypotheses of length <= 2 = 2 > 1.

Is Omega doing something I don't understand? Would all hypotheses be required to be some set length?

The hypotheses are encoded in a prefix-free way: no hypothesis is a prefix of another hypothesis.

In particular, this eliminates your example: if both strings of length 1 ("0" and "1") are valid hypotheses, then no string of longer length can be a valid hypothesis (then we can take Omega=1 and assign a probability of 1/2 to each). Or we could have "0", "10", and "110" be valid hypotheses; then we can take Omega = 7/8 and assign probabilities of 4/7, 2/7 and 1/7 to these.

In general, the prefix-free condition ensures that the sum over all hypotheses converges, by Kraft's inequality, which is the real heavy lifter here; the constant is merely there to make the sum over all hypotheses converge to 1 instead.

You might imagine that hypotheses are required to be grammatically correct English sentences (I guess encoded in ASCII or something). In that case, no hypothesis is a prefix of another, because each hypothesis ends with a period.

Right; I forgot that it used a prefix-free encoding. Apologies if the answer to this is painfully obvious, but does having a prefix-free encoding entail that there is a finite number of possible hypotheses?

It doesn't: for example, the set of strings 0, 10, 110, 1110, 11110, 111110, ... is infinite and prefix-free. So is the set of C-style (null-terminated) strings.

I see. Does the method of normalization you gave work even when there is an infinite number of hypotheses?

It does. The infinite sum will always converge, so by normalizing we can make it converge to 1.

Take gjm's approach to explaining the normalization, in which the initial weight of 2^-length assigned to a hypothesis is the probability that you obtain that hypothesis by choosing bits at random. Then the normalization is just a conditional probability: you divide by the probability that, when choosing bits at random, you do eventually end up hitting a hypothesis.

Remark: these are special cases of the schema "all strings that contain exactly one instance of pattern X in possible positions Y", where in the first case X is "0" and Y is "anywhere" and in the second X is "00000000" and Y is "any position that's a multiple of 8 bits".

(Of course there are plenty of other prefix-free sets of bitstrings. For instance: interpret blocks of 8 bits as characters as in Kindly's second example, and say that a valid program is anything that begins with a "(", has properly matched parentheses, and ends with the ")" matching the first "(". This doesn't have the convenient property that every bitstring begins with a valid program; for examples that do, take the exponential Golomb coding or the Elias omega coding of natural numbers.

I think you may be missing two things.

First: the hypotheses/programs (recall: we are representing hypotheses by computer programs in some suitable language) need to be written in a "prefix-free" representation, meaning one in which e.g. if 100111 is a valid program then nothing else beginning 100111 can be. So you *don't* get two programs of length 1, four of length 2, etc. If you do this, and if you also arrange your coding so that every infinite string of bits starts with *some* legal program, then the sum over all legal programs of 2^-length is exactly 1. (So Pr(program) = 2^-length(program) defines a probability distribution over programs. It's the same probability distribution as you get from the following procedure: keep generating random bits until the bits you've generated form a legal program.)

Second: you can't just take Ω=1 "for simplicity"; Ω is defined to be what you get by adding up 2^-length for all programs that halt. (Remember that hypotheses are being represented as computer programs here.) Equivalently, if you imagine picking a random well-formed program by choosing random bits, Ω is the probability that the resulting program halts. So it's certainly <1.

[EDITED to fix an ugly but hopefully not confusing typo.]

Consider an arbitrary probability distribution P, and the smallest integer (or the lexicographically least object) x such that P(x) < 1/3^^^3 (in Knuth's up-arrow notation). Since x has a short description, a universal distribution shouldn't assign it such a low probability, but P does, so P can't be a universal distribution.

The description of x has to include the description of P, and that has to be computable if a universal distribution is going to assign positive probability to x.

If P has a short computable description, then yes, you can conclude that P is not a universal distribution. Universal distributions are not computable.

If the shortest computable description of P is long, then you can't conclude from this argument that P is not a universal distribution, but I suspect that it still can't be a universal distribution, since P is computable.

If there is no computable description of P, then we don't know that there is a computable description of x, so you have no contradiction to start with.

Suppose we define a generalized version of Solomonoff Induction based on some second-order logic. The truth predicate for this logic can’t be defined within the logic and therefore a device that can decide the truth value of arbitrary statements in this logical has no finite description within this logic. If an alien claimed to have such a device, this generalized Solomonoff induction would assign the hypothesis that they're telling the truth zero probability, whereas we would assign it some small but positive probability

Actually, what Tarski seems to show is that for *any* language for describing *any* set of universes, there just *is* no language representable inside those universes for *representing* arbitrary statements, with truth values, about "everything" including the language and the statements in it. If you try to invent such a language, it will end up inconsistent - not at the point where it tries to correctly assign truth, but at the point where it can *represent* truth, due to analogues of "This statement is false." It isn't needful to assign 0 or 1, in particular, to this statement; the moment you represent it, you can prove an inconsistency. But is this really proper to blame on Solomonoff induction?

But is this really proper to blame on Solomonoff induction?

I would like to have a method of induction that for any formal language, assigns a non-zero prior to the existence of a device that can enumerate or decide all true sentences in that language, or alternatively an explanation based on reasonable principles for which such devices should have zero probabilities. Right now we do not have either, and your research program for improving SI (i.e., to base it on second-order logic) will not give us either even if it's successful. So while I'm not sure it makes sense to say I "blame" Solomonoff induction (what could that mean?), you could say that I'm not satisfied with either the status quo or any improvements to it that we can currently foresee.

Give me a set of formal languages over which you can say the phrase "for any formal language", and the truth predicate for the union of the set won't be in any language in the set. I'm still trying to understand how to deal with this inside AI, but I'm not sure that blaming it on second-order logical induction is putting the blame in the right place.

Again, I'm not sure what you mean by "blame" here. If you're saying that Tarski's result represents a problem that affects more than just attempts to generalize Solomonoff induction, then I agree.

BTW, while I have your attention, what's your evaluation of Paul Christiano's FAI design idea, which sort of tries to punt as many philosophical problems as possible (including this one)? I noticed that you didn't comment in that discussion.

A kilobit of improbability requires only a kilobit of data to offset it, which isn't very crackpot at all. Proving minimum length is impossible, but proving an upper bound on length is very easy, and that proves a lower bound on probability.

I take your point re: length vs speed. The theorem that I think justifies calling Kolmogorov Complexity objective is this:

"If K1 and K2 are the complexity functions relative to description languages L1 and L2, then there is a constant c (which depends only on the languages L1 and L2) such that |K1(s) - K2(s)| <= c for all strings s."

(To see why this is true, note that you can write a compiler for L2 in L1 and vice versa)

I don't see why code modelling symmetric laws should be longer than code modelling asymmetric laws (I'd expect the reverse; more symmetries means more ways to compress.) Nor why 3 spatial dimensions (or 10 if you ask string theorists) is the minimum number of spatial dimensions compatible with intelligent life.

The whole point of Solomonoff induction is that the priors of a theory are not arbitrary. They are determined by the complexity of the theory, then you use Bayes rule on all theories to do induction.

How can we apply Solomonoff when our inputs are not symbol strings?

Inputs can always be represented by bit sequences. Qualia are an irrelevance. Q.E.D.

Doesn't the "irrelevance" depend on an assumption about how our minds work? If we have some analog computation going on, the bit sequences might be a mere approximation to our actual inductive reasoning. Of course, once we put our conclusions into language, those conclusions can be digitized - but that might happen at a relatively late stage of the reasoning game.

Analog computation is another irrelevance - analog systems can always be approximated arbitrarily closely by discrete systems. Witness the ongoing digial revolution.

Sure, if you are designing a system from the ground up. I've never needed more than 12 bits of precision for any real engineering purpose, and I wouldn't pay a single dollar for more. But Wei_Dai's question was about *us* - or at least reads literally that way. Maybe I took it too literally.

There's no credible evidence that we aren't digital. E.g. see digital physics. If nobody can tell the difference, the issue is probably in the realm of vacuous philosophy..

Here's an idea for a modification of Solomonoff induction that no longer has a subjectively-chosen machine to encode hypotheses in. One could instead simply consider how many bits it would take to encode a solution on all universal Turing machines, and making each hypotheses’ prior equal to the average of its prior on each machine. This makes somewhat intuitive sense, as one doesn’t know which “machine” the universe in “programmed” on, so it’s best to just assume a random machine. Unless I’m mistaken, there’s an infinite number of universal Turing machines, but I think algorithms could still approximate the induction.

Is there any justification that Solomonoff Induction is accurate, other than intuition?

of course not

"Of course" implies that the answer is obvious. Why is it obvious?

Dunno what Username was thinking, but here's the answer I had in mind: "Why is it obvious? Because the Problem of Induction has not yet been solved."

Suppose we define a generalized version of Solomonoff Induction based on some second-order logic. The truth predicate for this logic can’t be defined within the logic and therefore a device that can decide the truth value of arbitrary statements in this logical has no finite description within this logic. If an alien claimed to have such a device, this generalized Solomonoff induction would assign the hypothesis that they're telling the truth zero probability, whereas we would assign it some small but positive probability.

I'm not sure I understand you correctly, but there are two immediate problems with this:

- If the goal is to figure out how useful Solomonoff induction is, then "a generalized version of Solomonoff Induction based on some second-order logic" is not relevant. We don't need random generalizations of Solomonoff induction to work in order to decide whether Solomonoff induction works. I think this is repairable, see below.
- Whether the alien has a device that does such-and-such is not a property of the world, so Solomonoff induction does not assign a probability to it. At any given time, all you have observed is the behavior of the device for some finite past, and perhaps what the inside of the device looks like, if you get to see. Any finite amount of past observations will be assigned positive probability by the universal prior so there is never a moment when you encounter a contradiction.

If I understand your issue right, you can explore the same issue using stock Solomonoff induction: What happens if an alien shows up with a device that produces some uncomputable result? The prior probability of the present situation will become progressively smaller as you make more observations and asymptotically approach zero. If we assume quantum mechanics really is nondeterministic, that will be the normal case anyway, so nothing special is happening here.

What does Solomonoff Induction actually say?

I believe this one has been closed ages ago by Alan Turing, and practically demonstrated for approximations by the investigation into busy beaver function for example. We won't be able to know BB(10) from God almighty. Ever.

I'm not sure what you're trying to say here, but if you consider this a relative weakness of Solomonoff Induction, then I think you're looking at it the wrong way. We will know it as well as we possibly could given the evidence available. Humans are subject to the constraints that Solomonoff Induction is subject to, and more.

Whoops, thread necromancy.

Solomonoff Induction is supposed to be a formalization of Occam’s Razor, and it’s confusing that the formalization has a free parameter in the form of a universal Turing machine that is used to define the notion of complexity.

I'm very confused about this myself, having just read this introductory paper on the subject.

My understanding is that an "ideal" reference UTM would be a universal turing machine with no bias towards any arbitrary string, but rigorously defining such a machine is an open problem. Based on our observation of UTMs, the more arbitrary simplifications a Turing Machine makes, the longer its compiler will have to be on other UTMs. This is called the Short Compiler Assumption. Since we can't agree on a single ideal reference UTM, we instead approximate it by limiting ourselves to a class of "natural" UTMs which are mutually interpretable within a constant. The smaller the constant, the less arbitrary simplification the UTMs in the class will tend to make.

This seems to mesh with the sequences post on Occam's Razor:

What if you don't like Turing machines? Then there's only a constant complexity penalty to design your own Universal Turing Machine that interprets whatever code you give it in whatever programming language you like.

What I'm confused about is this constant penalty. Is it just "some constant" or is it knowable and small? Is there a specific UTM for which we can always write a short compiler on any other UTM?

I'm getting out of my league here, but I'd guess that there is an upper bound on how complex you can make certain instructions across all UTMs because UTMs must (a) be finite, and (b) at the lowest level implement a minimal set of instructions, including a functionally full set of logical connectives. So for example, say I take as my "nonbiased" UTM a UTM that aside from the elementary operations of the machine on its tape, jump instructions, etc. has only a minimal number of instructions implementing a minimally complete operator set with less than two connectives: {NAND} or {NOR}. My understanding is that anything that's a Universal Turing Machine is going to have to itself have a small number of instructions that implement the basic machine instructions and a complete set of connectives somewhere in its instruction set, and converting between {NAND} or {NOR} and any other complete set of connectives can be done with a trivially short encoding.

Since we can't agree on a single ideal reference UTM, we instead approximate it by limiting ourselves to a class of "natural" UTMs which are mutually interpretable within a constant.

Even if you do that, you're left with an infinite number of cliques such that within any given clique the languages can write short interpreters for each other. Picking one of the cliques is just as arbitrary as picking a single language to begin with. i.e. for any given class of what-we-intuitively-think-of-as-complicated programs X, you can design a language that can concisely represent members of X and can concisely represent interpreters with this special case.

What I'm confused about is this constant penalty. Is it just "some constant" or is it knowable and small?

It's a function of the language you're writing an interpreter of and the language you're writing it in. "Constant" in that it doesn't depend on the programs you're going to run in the new language. i.e. for any given pair of languages there's a finite amount of disagreement between those two versions of the Solomonoff prior; but for any given number there are pairs of languages that disagree by more than that.

Is there a specific UTM for which we can always write a short compiler on any other UTM?

No. There's no law against having a gigabyte-long opcode for NAND, and using all the shorter opcodes for things-we-intuitively-think-of-as-complicated.

So is there then a pragmatic assumption that can be made? Maybe we assume that if I pick a turing machine blindly, without specifically designing it for a particular output string, it's unlikely to be strongly biased towards that string.

What probability distribution over turing machines do you blindly pick it from? That's another instance of the same problem.

Pragmatically, if I non-blindly pick some representation of turing machines that looks simple to me (e.g. the one Turing used), I don't really doubt that it's within a few thousand bits of the "right" version of solomonoff, whatever that means.

Pragmatically, if I non-blindly pick some representation of turing machines that looks simple to me (e.g. the one Turing used), I don't really doubt that it's within a few thousand bits of the "right" version of solomonoff, whatever that means.

Why not?

Apparent Unformalizability of “Actual” Induction

Re: Tarski’s Indefinability of Truth - I don't get this one. Solomonoff Induction estimates probabilities of bitstring sequences - but it's uncomputable - and the best we can do is finite approximations - which of course have limitations and will be able to be improved upon - just as Tarski’s "Indefinability of Truth" says.

Re: Argument via Berry’s Paradox - but the description of x includes P - which could be complex. So the description of x is not short - if you are not given P.

Is complexity objective?

Kinda. Some notions of complexity are more useful than others. However, condition your machine on some real world sense data for a while, and *any* reasonable prior soon gets swamped by data - so in practice, this is not a big deal.

I'm not sure what you have in mind by basing Solomonoff induction on a second-order logic. Can you give a reference?

Here is what I'd suggest as a baseline. Use John Tromp's Binary Lambda Calculus with some variation of Levin Search to explore time/complexity bounded programs that output some initial sequence of data.

Solomonoff Induction is supposed to be a formalization of Occam’s Razor, and it’s confusing that the formalization has a free parameter in the form of a universal Turing machine that is used to define the notion of complexity. What’s the significance of the fact that we can’t seem to define a parameterless concept of complexity? That complexity is subjective?

Perhaps we could formalize some argument that for an arbitrarily poor definition of complexity, the error becomes negligible for the average sufficiently complex universe?

Solomonoff Induction is defined over symbol strings (for example bit strings) but our perceptions are made of “qualia” instead of symbols. How is Solomonoff Induction supposed to work for us?

Our perceptions have some neurological representation.

The interesting bit is that if you could actually use Solomonoff induction as prior, I'm pretty sure you'd be an irreparable aether believer and 'relativity sceptic', with confidence that has so many nines nothing would *ever* convince you. That's because the absolute-speed irreversible aether universes like 3d versions of Conway's game of life are so much more computationally compact than highly symmetric universe with many space-time symmetries (to the point of time and space intermixing in the equations) and the fully reversible fundamental laws of physics. Saying this to you as applied mathematician with experience actually implementing laws of physics in software.

The priors are only a free parameter if all you care for is not getting Dutch-booked on your bets about physics but do not give a damn to have any useful theories.

That's because the absolute-speed irreversible aether universes like 3d versions of Conway's game of life are so much more computationally compact than highly symmetric universe with many space-time symmetries (to the point of time and space intermixing in the equations) and the fully reversible fundamental laws of physics.

Can you explain why this is the case? Also, when you say "aether" universes are more computationally compact than "relativity" universes, is this before or after taking into account our observations (i.e., are you restricting your attention to universes that fit our observations, or not)?

Saying this to you as applied mathematician with experience actually implementing laws of physics in software.

Is it possible that what you said is true only if we want the laws of physics to run fast on current computers? I'm afraid that most software people, including myself, probably have bad intuitions about Solomonoff Induction because we only have experience with a very small subset of possible computations, namely those that can be run economically on modern computers. Perhaps the laws of physics can be implemented much more compactly if we ignored efficiency?

Can you explain why this is the case?

As intuition pump consider the reversible vs irreversible cellular automata. If you pick at random, vast majority will not be reversible. Ditto for the symmetries. (Keep in mind that in Solomonoff probability we feed infinite random tape to the machine. It is no Occam's razor. Elegant simplest deterministic things can be vastly outnumbered by inelegant, even if they are most probable. edit: that is to say you can be more likely to pick something asymmetric even if any particular asymmetric is less likely than symmetric)

Also, when you say "aether" universes are more computationally compact than "relativity" universes, is this before or after taking into account our observations (i.e., are you restricting your attention to universes that fit our observations, or not)?

There can always be a vast conspiracy explaining the observations... ideally if you could simulate whole universe (or multiverse) from big bang to today and pick out the data matching observations or the conspired lying, then maybe it'd work, but the whole exercise of doing physics is that you are embedded within universe you are studying. edit: and that trick won't work if the code eats a lot of random tape.

Is it possible that what you said is true only if we want the laws of physics to run fast on current computers?

I don't think relaxing the fast requirement really helps that much. Consider programming Conway's game of life in Turing machine. Or vice versa. Or the interpreter for general TM on the minimal TM. It gets way worse if you want full rotational symmetry on discrete system.

Of course, maybe one of the small busy beavers is a superintelligence that likes to play with various rules like that. Then I'd be wrong. Can not rule even this possibility out. Kolmogorov/Solomonoff name drop is awesome spice for cooking proven-untestable propositions.

One could argue that second order logic could work better, but this is getting way deep into land of untestable propositions that are even proven untestable, and the appropriate response would be high expectations asian father picture with "why not third order logic?".

edit: also you hit nail on the head on an issue here: i can not be sure that there is no very short way to encode something. You can ask me if I am sure that busy beaver 6 is not *anything*, and I am not sure! I am not sure it is not the god almighty. The proposition that there is a simple way is a statement of faith that can not be disproved any more than existence of god. Also, I feel that there has to be scaling for the computational efficiency in the prior. The more efficient structures can run more minds inside. Or conversely, the less efficient structures take more coding to locate minds inside of them.

Discrete universes introduce an unobserved parameter (a fundamental length) as well as a preferred frame. Abandoning complex numbers for a countable group in QM would also be very difficult. There is a complexity theory for real-valued objects and if Solomonoff induction could be adapted to that context, continuum theories ought to be favored.

Yep. Well, in any case, the point is that the Solomonoff probability is not in any way 'universal prior', you have to pick the machine, then you can't actually compute anything useful because its uncomputable and you'll never get anything non-trivial all the way down to minimum length.

If you go for changing the machine, you could use laws of physics as you know them as the machine, too (then you'll be irreparable sceptic of any new physics). In the context of laws of physics its just the emulation of one universal computation (laws of physics) with another (what ever stuff you pick). We are only discussing this because some highly self-important people heard of Solomonoff induction and Kolmogorov complexity, modelled those with something ill defined and fuzzy (lacking not only sufficient understanding of those things but the understanding of the concepts in which to understand those, and so on several layers deep, as is usually the case when you choose something really complicated without having spent a lot of time studying towards it) and then used it as universal smart sounding word to say (e.g. here or here or here ). Name-dropping and nothing more.

Consider Penrose's "Angular momentum: An approach to combinatorial space-time" (math.ucr.edu/home/baez/penrose/Penrose-AngularMomentum.pdf)

Hence, we have a way of getting hold of the concept of Euclidean angle, starting from a purely combinatorial scheme

...

The central idea is that the system defines the geometry. If you like, you can use the conventional description to fit the thing into the ‘ordinary space-time’ to begin with, but then the geometry you get out is not necessarily the one you put into it

It seems that euclidean space, at least, can be derived as a limiting case from simple combinatorial principles. It is not at all clear that general relativity does not have kolmogorov complexity comparable to the cellular automata of your "aether universes".

Kolmogorov complexity of GR itself (text of GR or something) is irrelevant. Kolmogorov complexity of universe that has the symmetries of GR and rest of physics, is. Combinatorial principles are nice but it boils down to representing state of the universe with cells on tape of linear turing machine.

What does Solomonoff Induction actually say about, for example, whether we live in a creatorless universe that runs on physics?

The entirely depends on your definition of creator. Traditional creators such as the Christian god could potentially have enough explanatory power once properly defined, yet would end up horrendously complex (encoding an entire human-like mind into the primitive natural law) unless you further reduce the construction of the creator to a natural process.

Or the Simulation Argument?

Solomonoff Induction is empirical: unless the laws of our universe are broken by the simulators, Solomonoff Induction says nothing about whether we are in a simulation. If the simulators have not influenced our universe but will in the future, it depends entirely on whether the simulator universe is simpler than ours. If the simulators have already influenced our universe it's more complicated.

People are perfectly fine with fuzzy approximate explanations of phenomena, like Maxwell's equations &c. "Goddidit" is not that different. Trying to get a *full* causal explanation would mean finding bits of Omega. In the end, decision theory is fundamental, and epistemological abstractions like SI are cool but ultimately irrelevant. This whole "encoding a human-like mind" thing doesn't work like you think it does --- you can interpret SI that way and see some cool implications, just remember it's a useless toy model. ...Just sayin'.

"Goddidit" is not that different.

Physics theories import low-complexity mathematical models. "Goddidit" imports complicated human notions of agency. Approximate explanations are fine if we can reason that their implicit complexity is low relative to their explanatory power (a relatively easily satisfied metric, after which competition between theories becomes the key factor).

In Solomonoff Induction, theories that don't explain data must contain that data raw.

Physics theories import low-complexity mathematical models. "Goddidit" imports complicated human notions of agency.

Frankly, I think this idea is attractive but ultimately an error. It is indeed true that to an analytical mind with an interest in physics, mathematics feels a lot less complex, in some sense, than intuitive notions of agency. But no matter how much physics or psychology you know, you *don't have introspective access* to the universal prior --- maybe the prior privileges math over psychology, or maybe it doesn't. All we have is our evidence, often in the form of conclusions drawn from intuitive analyses of what hypotheses have or haven't tended to *bear intellectual or instrumental fruit* in the past --- id est, we're awfully close to talkin' 'bout pragmatics and decision theory here. And yes, mathematical explanations have been surprisingly effective. But if you look at human history, hypotheses that make use of "complicated human notions of agency" have also been pretty damn effective. It's not obvious what notion of complexity would massively privilege the former over the latter, and at any rate, we have no way of knowing, because you can't find the universal prior in your backyard.

It is indeed true that to an analytical mind with an interest in physics, mathematics feels a lot less complex,

We have objective verification of the low complexity of formalized mathematical theories because we can look at the length of their formal description in say, first-order logic.

But no matter how much physics or psychology you know, you don't have introspective access to the universal prior --- maybe the prior privileges math over psychology, or maybe it doesn't.

Are you really suggesting some model of computation based on human ideas might work better than say, lambda calculus for computing Kolmogorov complexity for Solomonoff Induction? I'm not sure how to argue with that but I would appreciate it if you would state it explicitly.

We have objective verification of the low complexity of formalized mathematical theories because we can look at the length of their formal description in say, first-order logic.

Right, and that'll be important if we ever run into aliens that for some reason can't wrap their brains around English, but instead can figure out our category theory notation and so on. Or if we're trying to build an FAI, or collaborate with the aforementioned aliens to build FAI.

I'm not sure how to argue with that but I would appreciate it if you would state it explicitly.

Apologies, inferential distance, and there's a few meta-level points that I think are important to communicate. But there's inferential distance on the meta level too.

Also keep in mind that algorithmic information/probability theory is actually quite hard to interpret correctly --- the basic, intuitive way to read meaning into the math is not quite the way to do it. cousin_it has a post or two correcting some intuitive errors of interpretation.

I found these:

Intuitive Explanation of Solomonoff Induction - lukeprog

Does Solomonoff always win? - cousin_it

K-complexity of everyday things - cousin_it

Solomonoff Induction, by Shane Legg - cousin_it

I would appreciate it if people could link me to more.

Alas, none of those are the relevant ones I think. I'm actually rather busy visiting home, so I can only justify certain comments to myself, but I hope someone provides the links.

For what it's worth, I'm a little skeptical of lukeprog's understanding of SI --- no offense to him meant, it's just I so happen to believe he made a rather big error when interpreting the math. On the other hand, cousin_it seems to be really on the ball here. Those are just my impressions; I'm a pretend philosopher, not a compsci dude. At any rate I think it'd be just dandy for cousin_it to check Luke's posts and share his impression or critiques.

Here's one I was thinking of:

The prior of a hypothesis does not depend on its complexity - cousin_it

(If I recall, Nesov's comment clearly demonstrates the important point.)

http://lesswrong.com/lw/328/description_complexity_an_apology_and_note_on/

we can adopt the general rule that mentioning K-complexity in a discussion of physics is always a sign of confusion :-)

Mentioning it anywhere except algorithmic information theory is a sign of confusion. This includes theology and parapsychology. Use just Bayes or, if you want to be all fancy, updateless-like decision theories. I love algorithmic probability to death but it's just not something you should use casually. Too many pitfalls.

Use just Bayes

Bayes requires a prior.

No one should ever need to discuss "priors". Focus on the likelihood ratio.

People are perfectly fine with fuzzy approximate explanations of phenomena, like Maxwell's equations &c.

Approximate explanations have some predictive power.

"Goddidit" is not that different.

What experience do you expect if "Goddidit", as opposed to if "Goddidntdoit"?

(Skeletons of angels versus skeletons of dinosaurs? People with supernatural powers versus people working with superstition? Benevolent universe versus indifferent universe?)

If in your heart you believe you already know, or if in your heart you do not wish to know, then your questioning will be purposeless and your skills without direction.

—Twelve Virtues of Rationality

It's just, I'm having an amazing time back home, and my time is limited. I don't know your goals, but you might want to try harder to signal that you're really curious and not just asking questions that you think are rhetorical. When you reference common knowledge 'round these parts, like Eliezer's posts, you should expect that the other person is already aware of that knowledge, and that they have real, substantive reasons to think that what they said is not entirely refuted by the contents of said common knowledge.

Of course, asking rhetorical questions is a perfectly decent way to make an *argument*. It's just that arguments in that sense aren't quite what's called for in situations like these, I think. But that might just be a difference in our epistemic styles, especially if you're Slavic. (Gasp, *racism!* ;P )

When you reference common knowledge 'round these parts, like Eliezer's posts, you should expect that the other person is already aware of that knowledge

Good point.

Also good point about time being limited, so...

If you'd someday later feel like writing a LW article about similarities between "Goddidit" and Maxwell's equations, or something like that, I will read it.

**[deleted]**· 2012-08-24T08:05:56.405Z · score: -2 (2 votes) · LW(p) · GW(p)

Solomonoff Induction is supposed to be a formalization of Occam’s Razor, and it’s confusing that the formalization has a free parameter in the form of a universal Turing machine that is used to define the notion of complexity. What’s the significance of the fact that we can’t seem to define a parameterless concept of complexity? That complexity is subjective?

In the Kolmogorov-Chaitin Minimum description length approach, the subject must pick a Turing machine whose operations describe the basic operations believed to represent 'simplicity' by the subject.

Kant postulated that all knowledge relies on synthetic, a priori laws of nature, like causality and substance. The problem, then, is how this is possible. Kant's solution was to reason that the subject must supply laws that make experience of objects possible, and that these laws are the synthetic, a priori laws of nature that we know apply to all objects before we experience them.

Kants apriori categories could be equivalent to the 'basic operations' of the aforementioned Turing machine. In the language of computer science, this would be equivalent to specifying an upper ontology defining all the fundamental ontological primatives in the domain 'reality'.