The Presumptuous Philosopher, self-locating information, and Solomonoff induction

post by Charlie Steiner · 2020-05-31T16:35:48.837Z · score: 40 (13 votes) · LW · GW · 26 comments

Contents

  Background:
  Main event:
  Questions:
None
26 comments

I need help working some ideas out.

Background:

If you're not familiar with the Presumptuous Philosopher, it's a thought experiment due to Bostrom about predicting the laws of the universe based on anthropic reasoning:

It is the year 2100 and physicists have narrowed down the search for a theory of everything to only two remaining plausible candidate theories: T1 and T2 (using considerations from super-duper symmetry). According to T1 the world is very, very big but finite and there are a total of a trillion trillion observers in the cosmos. According to T2, the world is very, very, very big but finite and there are a trillion trillion trillion observers. The super-duper symmetry considerations are indifferent between these two theories.
Physicists are preparing a simple experiment that will falsify one of the theories. Enter the presumptuous philosopher: “Hey guys, it is completely unnecessary for you to do the experiment, because I can already show you that T2 is about a trillion times more likely to be true than T1!”

If you're not familiar with Solomonoff induction, the relevant aspects are that it tries to predict your sequence of observations (not the laws of physics, at least not directly), and that it assigns probability to different sequences that decreases exponentially in how complicated those sequences of observations are. By "complicated" we don't merely mean how verbose the laws of physics are, we also have to include how complicated it is to go from the laws of physics to specifying your specific observations.

Main event:

Imagine that our Presumptuous Philosopher has just explained to the physicists that they already know that T2 is a trillion times more likely, because it has more observers in it.

But one of the engineers present is a Solomonoff inductor, and they ask "So does this mean that I should expect the experiment to show us T2, even though the theories are of similar complexity? How does the number of observers change the probability, if all I care about is how complicated it is to predict my sequence of observations?"

"Excellent question," says the Presumptuous Philosopher. "Think of T1 and T2 not as two different hypotheses, but as two different collections of hypothetical ways to predict your observations, with one actual hypothesis for each copy of you in the theory. If there are a trillion times more 'you's in T2, then there are a trillion times as many ways to predict your sequence of observations, which should be treated to a fair share of the probability."

"Okay, so you're saying the actual hypotheses that predict my observations, which I should assign probability to according to their complexity, are things like 'T1 and I'm person #1' or 'T2 and I'm person #10^10'?" says the Solomonoff inductor.

"Exactly."

"But I'm still confused. Because it still requires information to say that I'm person #1 or person #10^10. Even if we assume that it's equally easy to specify where a person is in both theories, it just plain old takes more bits to say 10^10 than it does to say 1."

"Uhh..."

"The amount of bits it takes to just say the number of the person adds log(n) bits to the length of the hypothesis. And so when I evaluate how likely that hypothesis is, which decreases exponentially as you add more bits, it turns out to actually be n times less likely. If we treat T1 and T2 as two different collections of numbered hypotheses, and add up all the probability in the collections, they don't add linearly, just because of the bits it takes to do the numbering. Instead, they add like the (discrete) integral of 1/n, which brings us back to log(n) scaling!" the Solomonoff inductor exclaims.

"Are you sure that's right?" says the Presumptuous philosopher. "If there was exactly 1 of you in T1, and 1 trillion of you in T2, then you're saying that the ratio of probabilities should actually be log(a trillion) to one, or only about 28 times more likely! Does that really make sense, that if there are trillion of you who are going to see the experiment turn out one way, and only one of you who is going to see it turn out the other way, the correct probability ratio is.... 28?"

"Well, when you put it that way, it does sound a little..."

"And what if there were 10 of you in T1 and 10 trillion in T2? When you multiply both populations by a constant, you only add to the log, and so the ratio changes! Now we're supposed to believe that the probability ratio is more like (28+log(10))/(1+log(10)) = 8. Eight?! By the time we get up to a trillion of you in T1 and a trillion trillion in T2, this would imply that T2 is only twice as likely, despite still having a trillion times more copies of you than T1."

Questions:

So, I'm really not sure what conclusion to draw from this. Do we hold that there's an important symmetry when we multiply the sizes of the universes, and therefore Solomonoff induction has to be modified? Or do we trust that Solomonoff induction is pretty theoretically sound, and reject the paradigm of bundling hypotheses about sequences of observations into "universes" that we treat as having symmetries?

Or maybe neither, and I've posed the question badly, or misinterpreted how Solomonoff induction applies here. But if so, I'd really like to know how.

26 comments

Comments sorted by top scores.

comment by johnswentworth · 2020-06-01T02:31:15.365Z · score: 12 (3 votes) · LW(p) · GW(p)

"Okay, so you're saying the actual hypotheses that predict my observations, which I should assign probability to according to their complexity, are things like 'T1 and I'm person #1' or 'T2 and I'm person #10^10'?" says the Solomonoff inductor.

"Exactly."

"But I'm still confused. Because it still requires information to say that I'm person #1 or person #10^10. Even if we assume that it's equally easy to specify where a person is in both theories, it just plain old takes more bits to say 10^10 than it does to say 1."

I think this section is confused about how the question "T1 or T2" gets encoded for a Solomonoff inductor.

Given the first chunk in the quote above, we don't have two world models; we have one world model for each person in T1, plus one world model for each person in T2. Our models are (T1 & person 1), (T1 & person 2), ..., (T2 & person 1), .... To decide whether we're in T1 or T2, our Solomonoff inductor will compare the total probability of all the T1 hypotheses to the total probability of all the T2 hypotheses.

Assuming T1 and T2 have exactly the same complexity, then presumably (T1 & person N) should have roughly the same complexity as (T2 & person N). That is not necessarily the case; T1/T2 may contain information which makes encoding some numbers cheaper/more expensive. But it does seem like a reasonable approximation for building intuition.

Anyway, point is, "it just plain old takes more bits to say 10^10 than it does to say 1" isn't relevant here. There's no particular reason to compare the two hypotheses (T1 & person 1) vs (T2 & person 10^10); that is not the correct formalization of the T1 vs T2 question.

comment by Charlie Steiner · 2020-06-01T21:33:31.167Z · score: 4 (2 votes) · LW(p) · GW(p)

I'm not really sure what you're arguing for. Yes, I've elided some details of the derivation of average-case complexity of bridging laws (which has gotten me into a few factors of two worth of trouble, as Donald Hobson points out), but it really does boil down to the sort of calculation I sketch in the paragraphs directly after the part you quote. Rather than just saying "ah, here's where it goes wrong" by quoting the non-numerical exposition, could you explain what conclusions you're led to instead?

comment by johnswentworth · 2020-06-02T03:14:11.386Z · score: 2 (1 votes) · LW(p) · GW(p)

Oh I see. The quoted section seemed confused enough that I didn't read the following paragraph closely, but the following paragraph had the basically-correct treatment. My apologies; I should have read more carefully.

comment by TurnTrout · 2020-05-31T17:46:26.775Z · score: 10 (2 votes) · LW(p) · GW(p)

The way I understand Solomonoff induction, it doesn't seem like the complexity of specifying the observer scales logarithmically with the total number of observers. it's not like there's a big phone book of observers in which locations are recorded. Rather, it should be the complexity of saying "and my camera is here". 

comment by Charlie Steiner · 2020-05-31T23:39:37.125Z · score: 6 (3 votes) · LW(p) · GW(p)

What's the minimum number of bits required to specify "and my camera is here," in such a way that it allows your bridging-law camera to be up to N different places?

In practice I agree that programs won't be able to reach that minimum. But maybe they'll be able to reach it relative to other programs that are also trying to set up the same sorts of bridging laws.

comment by frankybegs · 2020-06-03T00:47:40.131Z · score: 3 (2 votes) · LW(p) · GW(p)

I'm no expert on this sort of thing, but if I understood correctly this seems exactly right. The complexity of the expression of each observer case shouldn't differ based on the assigned position of the observer in an imaginary order. They're all observer 1a, 1b, 1c, surely?

comment by Tao Lin (tao-lin) · 2020-06-02T03:57:40.153Z · score: 1 (1 votes) · LW(p) · GW(p)

If observers are distributed with constant density in 3d space, then adding 3 extra bits to the position of a camera covers 8x the volume, and thus 8x the observers, so the scaling is the same as if you were just numbering the people

comment by interstice · 2020-06-01T20:25:54.217Z · score: 8 (3 votes) · LW(p) · GW(p)

It seems to me that there are (at least) two ways of specifying observers given a physical world-model, and two corresponding ways this would affect anthropics in Solomonoff induction:

  • You could specify their location in space-time. In this case, what matters isn't the number of copies, but rather their density in space, because observers being more sparse in the universe means more bits are needed to pin-point their location.

  • You could specify what this type of observer looks like, run a search for things in the universe matching that description, then pick one off the list. In this case, again what matters is the density of us(observers with the sequence of observations we are trying to predict) among all observers of the same type.

Which of the two methods ends up being the leading contributor to the Solomonoff prior depends on the details of the universe and the type of observer. But either way, I think the Presumptuous Philosopher's argument ends up being rejected: in the 'searching' case, it seems like different physical theories shouldn't affect the frequency of different people in the universe, and in the 'location' case, it seems that any physical theory compatible with local observations shouldn't be able to affect the density much, because we would perceive any copies that were close enough.

comment by Charlie Steiner · 2020-06-01T22:34:30.940Z · score: 2 (1 votes) · LW(p) · GW(p)

Yeah, the log(n) is only the absolute minimum. If you're specifying yourself mostly by location, then for there to be n different locations you need at least log(n) bits on average (but in practice more), for example.

But I think it's plausible that the details can be elided when comparing two very similar theories - if the details of the bridging laws are basically the same and we only care about the difference in complexity, that difference might be about log(n).

comment by interstice · 2020-06-02T01:12:57.396Z · score: 7 (2 votes) · LW(p) · GW(p)

Another thing, I don't think Solomonoff Induction would give an advantage of log(n) to theories with n observers. In the post you mention taking the discrete integral of to get log scaling, but this seems to be based on the plain Kolmogorov complexity , for which is approximately an upper bound. Solomonoff induction uses prefix complexity , and the discrete integral of converges to a constant. This means having more copies in the universe can give you at most a constant advantage.

(Based on reading some other comments it sounds like you might already know this. In any case, it means S.I. is even more anti-PP than implied in the post)

comment by Charlie Steiner · 2020-06-02T14:25:44.943Z · score: 2 (1 votes) · LW(p) · GW(p)

Actually I had forgotten about that :)

comment by jessicata (jessica.liu.taylor) · 2020-05-31T20:12:40.757Z · score: 4 (2 votes) · LW(p) · GW(p)

My understanding is that Solomonoff induction leads to more SSA-like behavior than SIA-like, at least in the limit, so will reject the presumptuous philosopher's argument.

Asserting that there are n people takes at least K(n) bits, so large universe sizes have to get less likely at some point.

comment by johnswentworth · 2020-06-01T02:36:43.551Z · score: 6 (3 votes) · LW(p) · GW(p)

Asserting that there are n people takes at least K(n) bits, so large universe sizes have to get less likely at some point.

The problem setup doesn't necessarily require asserting the existence of n people. It just requires setting up a universe in which n people happen to exist. That could take considerably less than K(n) bits, if person-detection is itself fairly expensive. We could even index directly to the Solomonoff inductor's input data without attempting to recognize any agents; that would circumvent the K(number of people) issue.

comment by jessicata (jessica.liu.taylor) · 2020-06-01T03:39:32.762Z · score: 7 (4 votes) · LW(p) · GW(p)

If there's a constant-length function mapping the universe description to the number of agents in that universe, doesn't that mean K(n) can't be more than the Kolmogorov complexity of the universe by more than that constant length?

If it isn't constant-length, then it seems strange to assume Solomonoff induction would posit a large objective universe, given that such positing wouldn't help it predict its inputs efficiently (since such prediction requires locating agents).

This still leads to the behavior I'm talking about in the limit; the sum of 1/2^K(n) over all n can be at most 1 so the probabilities on any particular n have to go arbitrarily small in the limit.

comment by TurnTrout · 2020-06-01T15:42:21.033Z · score: 2 (1 votes) · LW(p) · GW(p)

If it isn't constant-length, then it seems strange to assume Solomonoff induction would posit a large objective universe, given that such positing wouldn't help it predict its inputs efficiently (since such prediction requires locating agents).

but a solomonoff ind doesn’t rank hypotheses on whether they allow efficient predictions of some feature of interest, it ranks them based on posterior probabilities (prior probability + to what extent the hypothesis accurately predicted observations so far).

comment by jessicata (jessica.liu.taylor) · 2020-06-01T15:45:47.437Z · score: 2 (1 votes) · LW(p) · GW(p)

I mean efficiently in terms of number of bits, not computation time. Which contributes to posterior probability.

comment by johnswentworth · 2020-06-01T13:59:09.399Z · score: 2 (1 votes) · LW(p) · GW(p)

I'm about 80% on board with that argument.

The main loophole I see is that number-of-embedded-agents may not be decidable. That would make a lot of sense, since embedded-agent-detectors are exactly the sort of thing which would help circumvent diagonalization barriers. That does run into the second part of your argument, but notice that there's no reason we need to detect all the agents using a single program in order for the main problem setup to work. They can be addressed one-by-one, by ad-hoc programs, each encoding one of the hypotheses (world model, agent location).

(Personally, though, I don't expect number-of-embedded-agents to be undecidable, at least for environments with some kind of private random bit sources.)

comment by jessicata (jessica.liu.taylor) · 2020-06-01T14:48:19.817Z · score: 4 (2 votes) · LW(p) · GW(p)

At this point it seems simplest to construct your reference class so as to only contain agents that can be found using the same procedure as yourself. Since you have to be decidable for the hypothesis to predict your observations, all others in your reference class are also decidable.

comment by johnswentworth · 2020-06-01T15:00:29.176Z · score: 4 (2 votes) · LW(p) · GW(p)

Problem is, there isn't necessarily a modular procedure used to identify yourself. It may just be some sort of hard-coded index. A Solomonoff inductor will reason over all possible such indices by reasoning over all programs, and throw out any which turn out to not be consistent with the data. But that behavior is packaged with the inductor, which is not itself a program.

comment by jessicata (jessica.liu.taylor) · 2020-06-01T15:10:04.132Z · score: 4 (2 votes) · LW(p) · GW(p)

Yes, I agree. "Reference class" is a property of some models, not all models.

comment by Charlie Steiner · 2020-06-01T23:07:10.455Z · score: 2 (1 votes) · LW(p) · GW(p)

I am usually opposed on principle to calling something "SSA" as a description of limiting behavior rather than inside-view reasoning, but I know what you mean and yes I agree :P

I am still surprised that everyone is just taking Solomonoff induction at face value here and not arguing for anthropics. I might need to write a follow-up post to defend the Presumptuous Philosopher, because I think there's a real case the Solomonoff induction actually is missing something. I bet I can make it do perverse things in decision problems that involve being copied.

comment by Donald Hobson (donald-hobson) · 2020-05-31T19:24:20.723Z · score: 3 (2 votes) · LW(p) · GW(p)

I trust Solomonoff induction as being pretty theoretically sound. The typical number takes around log(n)+log(log(n)) bits to express, as you have to say how many bits you are using to express the number. Some numbers, like Graham's number can be expressed with far fewer bits. I think that theories are a useful conceptual tool for bundling hypothesis about series of observations, and that T1 and T2 are equally likely.

comment by Charlie Steiner · 2020-06-01T22:35:57.792Z · score: 2 (1 votes) · LW(p) · GW(p)

Good catch - I'm missing some extra factors of 2 (on average).

And gosh, I expected more people defending the anthropics side of the dilemma here.

comment by Bunthut · 2020-06-03T13:48:17.049Z · score: 2 (1 votes) · LW(p) · GW(p)

I've thought about something very similar before, and the conclusion I came to was that the number of copies in a world makes no difference to its likelihood. As far as I can tell, the disagreement is here:

"But I'm still confused. Because it still requires information to say that I'm person #1 or person #10^10. Even if we assume that it's equally easy to specify where a person is in both theories, it just plain old takes more bits to say 10^10 than it does to say 1."
"The amount of bits it takes to just say the number of the person adds log(n) bits to the length of the hypothesis. And so when I evaluate how likely that hypothesis is, which decreases exponentially as you add more bits, it turns out to actually be n times less likely. If we treat T1 and T2 as two different collections of numbered hypotheses, and add up all the probability in the collections, they don't add linearly, just because of the bits it takes to do the numbering. Instead, they add like the (discrete) integral of 1/n, which brings us back to log(n) scaling!"

In T1, there is only one copy of you, so you dont need any bits to specify which one you are. In T2, there is a trillion copies of you, so you need log2(a trillion) ~= 40 bits to specify which one you are. This makes each of those hypothesis times less likely, and since theres a trillion of them, it cancels exactly.

Whereas you seem to think that saying you are #1/1 000 000 000 000 takes fewer bits than saying you are #538 984 236 587/1 000 000 000 000. I can see how you get that idea when you imagine physically writing them out, but the thing that allows you to skip initial 0s in physical writing is that you have non-digit signs that tell you when the number encoding is over. You would need at least one such sign as a terminal character if you wanted to represent them like that in the computer, so you would actually need 3 signs instead of two. I'm pretty sure that comes out worse overall in terms of information theory then taking a set 40 bits.

Or perhaps an easier way to see the problem: If we take your number encoding seriously, it implies that "T2 and I'm person #1" is more likely than "T2 and I'm person #10^10", since it would have fewer bits. But the order in which we number the people is arbitrary. Clearly something has gone wrong upstream from that.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2020-06-02T10:57:54.438Z · score: 2 (1 votes) · LW(p) · GW(p)

I think you are understanding the implications of Solomonoff Induction mostly correctly.

As a philosopher, at first my goal was to prove from first principles what the correct way to update beliefs based on evidence was. Solve the Problem of Induction. I figured some variant of Bayesianism, probably Solomonoff Induction, might do the trick.

Then as I learned more, I figured this would take a while to say the least and that in the meantime it would be good to know what method of belief-updating seemed best, even if we couldn't rigorously prove it.

Then I learned more, particularly about anthropics, and I set my sights even lower, to just have a method of belief-updating that, while still seeming wrong, at least has the property that following it wouldn't drive me insane, i.e. adds up to normality in the broadest and weakest sense.

Even this turned out to be surprisingly difficult to achieve, at least if we also want our method to be interestingly different from "Use your intuition."

Currently solomonoff induction or some variant of it is my favorite.

comment by Pattern · 2020-06-01T20:15:26.912Z · score: 2 (1 votes) · LW(p) · GW(p)

If the philosopher is willing to bet at those odds...