# Friedman's "Prediction vs. Explanation"

post by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2008-09-29T06:15:34.000Z · LW · GW · Legacy · 79 commentsWe do ten experiments. A scientist observes the results, constructs a theory consistent with them, and uses it to predict the results of the next ten. We do them and the results fit his predictions. A second scientist now constructs a theory consistent with the results of all twenty experiments.

The two theories give different predictions for the next experiment. Which do we believe? Why?

One of the commenters links to *Overcoming Bias*, but as of 11PM on Sep 28th, David's blog's time, no one has given the exact answer that I would have given. It's interesting that a question so basic has received so many answers.

## 79 comments

Comments sorted by oldest first, as this post is from before comment nesting was available (around 2009-02-27).

## comment by Psy-Kosh · 2008-09-29T06:32:28.000Z · LW(p) · GW(p)

Hrm, I'd have to say go with whichever is simpler (choose your favorite reasonable method of measuring the complexity of a hypothesis) for the usual reasons. (less bits to describe it means less stuff that has to be "just so", etc etc... Of course, modify this a bit if one of the hypothesies has a significantly different prior than the other due to previously learned info, but...) But yeah, the less complex one that works is more likely to be closer to the underlying dynamic.

If you're handed the two hypothesies as black boxes, so that you can't actually see inside them and work out which is more complex, then go with the first one. The first one, since it's more likely to be less complex (since maximum only the first ten data points could have been in some way explicitly hard coded into it. It successfully really predicted the next ten. The second one could, potentially, have in some way all twenty data points hard coded into it, and thus be more complex and thus effectively less likely to actually have anything resembling the underlying dynamic encoded into it)

## comment by Christopher_Monsour · 2008-09-29T06:37:56.000Z · LW(p) · GW(p)

Is it cheating to say that it depends hugely on the content of the theories, and their prior probabilities?

Replies from: subod_83## comment by Peter_de_Blanc · 2008-09-29T06:39:38.000Z · LW(p) · GW(p)

The theories screen off the theorists, so if we knew the theories then we could (given enough cleverness) decide based on the theories themselves what our belief should be.

But before we even look at the theories, you ask me which theory I expect to be correct. I expect the one which was written earlier to be correct. This is not because it matters which theory came first, irrespective of their content; it is because I have different beliefs about what each of the two theories might look like.

The first theorist had less data to work with, and so had less data available to insert into the theory as parameters. This is evidence that the first theory will be smaller than the second theory. I assign greater prior probabilities to small theories than to large theories, so I think the first theory is more likely to be correct than the second one.

## comment by Paul_Gowder2 · 2008-09-29T06:44:00.000Z · LW(p) · GW(p)

I rather like the 3rd answer on his blog (Doug D's). A slight elaboration on that -- one virtue of a scientific theory is its generality, and prediction is a better way of determining generality than explanation -- demanding predictive power from a theory excludes ad hoc theories of the sort Doug D mentioned, that do nothing more than re-state the data. This reasoning, note, does not require any math. :-)

## comment by Paul_Gowder2 · 2008-09-29T06:45:11.000Z · LW(p) · GW(p)

(Noting that the math-ey version of that reason has just been stated by Peter and Psy-kosh.)

## comment by JulianMorrison · 2008-09-29T06:55:52.000Z · LW(p) · GW(p)

The first guy has demonstrated prediction, the second only hindsight. We assume the first theory is right - but of course, we do the next experiment, and then we'll know.

## comment by uu · 2008-09-29T06:59:11.000Z · LW(p) · GW(p)

Assuming both of them can produce values (are formulated in such a way that they can produce a new value with just the past values + the environment)

The second theory has the risk of being more descriptive than predictive. It has more potential of being fit to the input data, including all its noise, and to be a (maybe complex) enumeration of its values.

The first one has at least proven it could be used to predict, while the second one can only produce a new value.

I would thus give more credit to the first theory. At least it has won against ten coin flips without omniscience.

## comment by Anonymous_Coward4 · 2008-09-29T07:12:25.000Z · LW(p) · GW(p)

*Which do we believe?*

What exactly is meant here by 'believe'? I can imagine various interpretations.

a. Which do we believe to be 'a true capturing of an underlying reality'? b. Which do we believe to be 'useful'? c. Which do we prefer, which seems more plausible?

a. Neither. *Real* scientists don't believe in theories, they just test them. Engineers believe in theories :-)

b. Utility depends on what you're trying to do. If you're an economist, then a beautifully complicated post-hoc explanation of 20 experiments may get your next grant more easily than a simple theory that you can't get published.

c. Who developed the theories? Which theory is simpler? (Ptolemy, Copernicus?) Which theory fits in best with other well-supported pre-existing theories? (Creationism, Evolution vs. theories about disease behaviour). Did any unusual data appear in the last 10 experiments that 'fitted' the original theory but hinted towards an even better theory? What is meant by 'consistent' (how well did it fit within error bands, how accurate is it)? Perhaps theory 1 came from Newton, and theory 2 was thought up by Einstein. How similar were the second sets of experiments to the original set?

How easy/difficult were the predictions? In other words, how well did they steer us through 'theory-space'? If theory 1 predicts the sun would come up each day, it's hardly as powerful as theory 2 which suggests the earth rotates around the sun.

What do we mean when we use the word 'constructs'? Perhaps the second theorist blinded himself to half of the results, constructed a theory, then tested it, placing himself in the same position as the original theorist but with the advantage of having tested his theory before proclaiming it to the world? Perhaps the constructor repeated this many times using different subsets of the data to build a predictor and test it; and chose the theorem which was most consistently suggested by the data and verified by subsequent testing.

Perhaps he found that no matter how he sliced and diced and blinded himself to parts of the data, his hand unerringly fell on the same 'piece of paper in the box' (to use the metaphor from the other site).

Another issue is 'how important is the theory'? For certain important theories (development of cancer, space travel, building new types of nuclear reactors etc.), neither 10 nor 20 large experiments might be sufficient for society to confer 'belief' in an engineering sense.

Other social issues may exist. Galileo 'believed' bravely, but perhaps foolishly, depending on how he valued his freedom.

d. Setting aside these other issues, and in the absence of any other information: As a scientist, my attitude would be to believe neither, and test both. As an engineer, my attitude would be to 'prefer' the first theory (if forced to 'believe' only one), and ask a scientist to check out the other one.

## comment by Chiao3 · 2008-09-29T07:29:45.000Z · LW(p) · GW(p)

Both theories fit 20 data points. That some of those are predictions is irrelevant, except for the inferences about theory simplicity that result. Since likelihoods are the same, those priors are also the posteriors.

My state of belief is then represented by a certain probability that each theory is true. If forced to pick one out of the two, I would examine the penalties and payoffs of being correct and wrong, ala Pascal's wager.

## comment by Brian_Scurfield · 2008-09-29T07:31:24.000Z · LW(p) · GW(p)

*We do ten experiments. A scientist observes the results, constructs a theory consistent with them*

Huh? How did the scientist know what to observe without already having a theory? Theories arise as explanations for problems, explanations which yield predictions. When the first ten experiments were conducted, our scientist would therefore be testing predictions arising from an explanation to a problem. He wouldn't just be conducting any old set of experiments.

Similarly the second scientist's theory would be a different explanation of the problem situation, one yielding a different prediction. Before the decisive test, the theory that emerges as the best explanation under the glare of critical scrutiny would be the preferred explanation. Without knowing the problem situation and the explanations that have been advanced it cannot be determined which is to be preferred.

## comment by Brian_Scurfield · 2008-09-29T07:32:39.000Z · LW(p) · GW(p)

*We do ten experiments. A scientist observes the results, constructs a theory consistent with them*

Huh? How did the scientist know what to observe without already having a theory? Theories arise as explanations for problems, explanations which yield predictions. When the first ten experiments were conducted, our scientist would therefore be testing predictions arising from an explanation to a problem. He wouldn't just be conducting any old set of experiments.

Similarly the second scientist's theory would be a different explanation of the problem situation, one yielding a different prediction. Before the decisive test, the theory that emerges as the best explanation under the glare of critical scrutiny would be the preferred explanation. Without knowing the problem situation and the explanations that have been advanced it cannot be determined which is to be preferred.

## comment by Ian_C. · 2008-09-29T07:36:55.000Z · LW(p) · GW(p)

One theory has a track record of prediction, and what is being asked for is a prediction, so at first glance I would choose that one. But the explanation based-one is built on more data.

But it is neither prediction nor explanation that makes things happen in the real world, but causality. So I would look in to the two theories and pick the one that looks to have identified a real cause instead of simply identifying a statistical pattern in the data.

## comment by Grant · 2008-09-29T08:25:32.000Z · LW(p) · GW(p)

I don't think the second theory is any less "predictive" than the first. It could have been proposed at the same time or before the first, but it wasn't. Why should the predictive ability of a theory vary depending on the point in time in which it was created? David Friedman seems to prefer the first because it demonstrates more ability on the part of the scientist who created it (i.e., he got it after only 10 tries).

Unless we are given any more information on the problem, I think I agree with David.

## comment by Vladimir_Nesov · 2008-09-29T08:53:08.000Z · LW(p) · GW(p)

These theories are evidence about true distribution of data, so I construct a new theory based on them. I then could predict the next data point using my new theory, and if I have to play this game go back and choose one of the original theories that gives the same prediction, based only on prediction about this particular next data point, independently on whether selected theory as a whole is deemed better.

Having more data is strictly better. But I could expect that there is a good chance that a particular scientist will make an error (worse than me now, judging his result, since he himself could think about all of this and, say, construct a theory from first 11 data points and verify the absence of this systematic error using the rest, or use a reliable methodology). Success of the first theory gives evidence for it, which depending on my priors can significantly overweight expected improvement from more data points coming through imperfect procedure of converting into a theory.

## comment by Tyrrell_McAllister2 · 2008-09-29T08:54:51.000Z · LW(p) · GW(p)

Here's my answer, prior to reading any of the comments here, or on Friedman's blog, or Friedman's own commentary immediately following his statement of the puzzle. So, it may have already been given and/or shot down.

We should believe the first theory. My argument is this. I'll call the first theory T1 and the second theory T2. I'll also assume that both theories made their predictions with certainty. That is, T1 and T2 gave 100% probability to all the predictions that the story attributed to them.

First, it should be noted that the two theories *should* have given the same prediction for the next experiment (experiment 21). This is because T1 *should* have been the best theory that (would have) predicted the first batch. And since T1 also correctly predicted the second batch, it should have been the best theory that would do that, too. (Here, "best" is according to whatever objective metric evaluates theories with respect to a given body of evidence.)

But we are told that T2 makes exactly the same predictions for the first two batches. So it also should have been the best such theory. It should be noted that T2 has no more information with which to improve itself. T1, for all intents and purposes, also knew the outcomes of the second batch of experiments, since it predicted them with 100% certainty. Therefore, the theories *should* have been the best possible given the first two batches. In particular, they should have been equally good.

But if "being the best, given the first two batches" doesn't determine a prediction for experiment 21, then neither of these "best" theories should be predicting the outcome of experiment 21 with certainty. Therefore, since it is given that they *are* making such predictions, they *should* be making the same one.

It follows that at least one of the theories is not the best, given the evidence that it had. That is, at least one of them was constructed using flawed methods. T2 is more likely to be flawed than is T1, because T2 only had to post-dict the second batch. This is trivial to formalize using Bayes's theorem. Roughly speaking, it would have been harder for T1 to been constructed in a flawed way and still have gotten its predictions for the second batch right.

Therefore, T1 is more likely to be right than is T2 about the outcome of experiment 21.

## comment by Vladimir_Nesov · 2008-09-29T09:08:42.000Z · LW(p) · GW(p)

(And, of course, first theory could be improved using the next 10 data points by Bayes' rule, which will give a candidate for being the second theory. This new theory can even disagree with the first on which value of particular data point is most likely.)

## comment by Kellopyy · 2008-09-29T09:15:49.000Z · LW(p) · GW(p)

Knowing how theories and experiements were chosen would make this more sensible problem. Having that information would affect our expectations about theories - as others have noted there are a lot of theories one could form in ad hoc manner, but question is which of them was selected.

First theory has been selected with first ten experiements and it seems to have survived second set of experiements. If experiements were independent from first set of experiements and from each other this is quite unlikely so this is strong evidence that first theory is the connection between experiements.

Given reasonable way of choosing theories I would rate both theories as likely, but given finite resources and fallible theorists I would prefer first theory as we have evidence that it was chosen sensibly and that the problem is explainable with theory of its calibre, but only to extent how far I doubt rationality of theorist making second theory.

## comment by Vijay_Krishnan · 2008-09-29T09:42:14.000Z · LW(p) · GW(p)

I would go with the first one in general. The first one has proved itself on some test data, while all the second one has done is to fit a model on given data. There is always the risk that the second theory has overfitted a model with no worthwhile generalization accuracy. Even if the second theory is simpler than the first the fact that the first theory has been proved right on unseen data makes it a slam dunk winner. Of course further experiments may cause us to update our beliefs, particularly if theory 2 is proving just as accurate.

## comment by Aron · 2008-09-29T09:55:46.000Z · LW(p) · GW(p)

There are an infinite number of models that can predict 10 variables, or 20 for that matter. The only probable way for scientist A to predict a model out of the infinite possible ones is to bring prior knowledge to the table about the nature of that model and the data. This is also true for the second scientist, but only slightly less so.

Therefore, scientist A has demonstrated a higher probability of having valuable prior knowledge.

I don't *think* there is much more to this than that. If the two scientists have equal knowledge there is no reason the second model need be more complicated than the first since the first fully described the extra revealed data in the second.

If it was the same scientist with both sets of data then you would pick the second model.

## comment by Benya_Fallenstein (Benja_Fallenstein) · 2008-09-29T11:05:36.000Z · LW(p) · GW(p)

Tyrrell's argument seems to me to hit the nail on the head. (Although I would have liked to see that formalization -- it seems to me that while T1 will be preferred, the preference *may* be extremely slight, depending. No, I'm too lazy to do it myself :-))

## comment by AnnaSalamon · 2008-09-29T12:14:40.000Z · LW(p) · GW(p)

Formalizing Vijay's answer here:

The short answer is that you should put more of your probability mass on T1's prediction because experts vary, and an expert's past performance is at least somewhat predictive of his future performance.

We need to assume that all else is symmetrical: you had equal priors over the results of the next experiment before you heard the scientists' theories; the scientists were of equal apparent caliber; P( the first twenty experimental results | T1 ) = P( the first twenty experimental results | T2); neither theorist influenced the process by which the next experiment was chosen; etc.

Suppose we have a bag of experts, each of which contains a function for generating theories from data. We draw a first expert from our bag at random and show him data points 1-10; expert 1 generates theory T1. We draw a second expert from our bag at random and show him data points 1-20: expert 2 generates theory T2.

Given the manner in which real human experts vary (some know more than others about a given domain; some aim for accuracy where others aim to support their own political factions; etc.), it is reasonable to suppose that some experts have priors that are well aligned with the problem at hand (or behave as if they do) while others have priors that are poorly aligned. Expert 1 distinguished himself by accurately predicting the results of experiments 11-20 from the results of experiments 1-10; many predictive processes would not have done so well. Expert 2 has only shown an ability to find * some * theory that is consistent with the results of experiments 1-20; * many * predictive processes put a non-zero prior on some such theory that would not have given the results of experiments 11-20 "most expected" status based only on the results from experiments 1-10. We should therefore expect better future performance from Expert 1, all else equal.

The problem at hand is complicated slightly in that we are judging, not experts, but theories, and the two experts generated their theories at different times from different amounts of information. If Expert 1 would have assigned a probability < 1 to results 11-20 (despite producing a theory that predicted those results), Expert 2 is working from more information than Expert 1, which gives Expert 2 at least a slight advantage. Still, given the details of human variability and the fact that Expert 1 did predict results 11-20, I would expect the former consideration to outweigh the latter.

## comment by Alex_U · 2008-09-29T12:29:44.000Z · LW(p) · GW(p)

Scientist 2's theory is more susceptible to over-fitting of the data; we have no reason to believe it's particularly generalizable. His theory could, in essence, simply be restating the known results and then giving a more or less random prediction for the next one. Let's make it 100,000 trials rather than 20 (and say that Scientist A has based his yet-to-be-falsified theory off the first 50,000 trials), and stipulate that Scientist 2 is a neural network -- then the answer seems clear.

## comment by Tyrrell_McAllister2 · 2008-09-29T13:01:57.000Z · LW(p) · GW(p)

I wrote in my last comment that "T2 is more likely to be flawed than is T1, because T2 only had to post-dict the second batch. This is trivial to formalize using Bayes's theorem. Roughly speaking, it would have been harder for T1 to been constructed in a flawed way and still have gotten its predictions for the second batch right."

Benja Fallenstein asked for a formalization of this claim. So here goes :).

Define a *method* to be a map that takes in a batch of evidence and returns a theory. We have two assumptions

ASSUMPTION 1: The theory produced by giving an input batch to a method will at least predict that input. That is, no matter how flawed a method of theory-construction is, it won't contradict the evidence fed into it. More precisely,

p( M(B) predicts B ) = 1.

(A real account of hypothesis testing would need to be much more careful about what constitutes a "contradiction". For example, it would need to deal with the fact that inputs aren't absolutely reliable in the real world. But I think we can ignore these complications in this problem.)

ASSUMPTION 2: If a method M is known to be flawed, then its theories are less likely to make correct predictions of future observations. More precisely, if B2 is not contained in B1, then

p( M(B1) predicts B2 | M flawed ) < P( M(B1) predicts B2 ).

(Outside of toy problems like this one, we would need to stipulate that B2 is not a logical consequence of B1, and so forth.)

Now, let B1 and B2 be two disjoint and nonempty sets of input data. In the problem, B1 is the set of results of the first ten experiments, and B2 is the set of results of the next ten experiments.

My claim amounted to the following. Let

P1 := p( M is flawed | M(B1) predicts B2 ),

P2 := p( M is flawed | M(B1 union B2) predicts B2 ).

Then P1 < P2

To prove this, note that, by Bayes's theorem, the second quantity P2 is given by

P2 = p( M(B1 union B2) predicts B2 | M is flawed ) * p(M is flawed) / p( M(B1 union B2) predicts B2 ).

Since p(X) = 1 implies p(X|Y) = 1 when Y is nonempty, Assumption 1 tells us that this reduces to

P2 = p(M is flawed).

On the other hand, the first quantity P1 is

P1 = p( M(B1) predicts B2 | M is flawed ) * p( M is flawed) / p( M(B1) predicts B2 ).

By Assumption 2, this becomes

P1 < p( M is flawed ).

Hence, P1 < P2, as claimed.

## comment by Anonymous_Coward4 · 2008-09-29T13:04:08.000Z · LW(p) · GW(p)

Throughout these replies there is a belief that theory 1 is 'correct through skill'. With that in mind it is hard to come to any other conclusion than 'scientist 1 is better'.

Without knowing more about the experiments, we can't determine if theory 1's 10 good predictions were simply 'good luck' or accident.

If your theory is that the next 10 humans you meet will have the same number of arms as they have legs, for example...

There's also potential for survivorship bias here. If the first scientist's results had been 5 correct, 5 wrong, we wouldn't be having this discussion about the quality of their theory-making skills. Without knowing if we are 'picking a lottery winner for this comparison' we can't tell if those ten results are chance or are meaningful predictions.

## comment by Bertil_Hatt · 2008-09-29T13:05:11.000Z · LW(p) · GW(p)

I'd use the only tool we have to sort theories: Occam's razor.

- Weed out all the theories that do not match the experiment — keep both in that case.
- Sort them by how simple they are.

This is what many do by assuming the second is “over-fitted”; I believe a good scientist would search the literature before stating a theory, and know about the first one; as he would also appreciate elegance, I'd expect him to come up with a simpler theory — but, as you pointed out, some time in a economics lab could easily prove me wrong, although I'm assuming the daunting complexity corresponds to plumbing against experiment disproving a previous theory, not the case that we consider here.

In one word: the second (longer references).

The barrel and box analogy hides that simplicity argument, by making all theories a ‘paper’. A stern wag of the finger to anyone who used statistical references, because there aren't enough data to do that.

## comment by AnnaSalamon · 2008-09-29T13:30:52.000Z · LW(p) · GW(p)

Peter, your point that we have different beliefs about the theories prior to looking at them is helpful. AFAICT theories don't screen off theorists, though. My belief that the college baseball team will score at least one point in every game ("theory A"), including the next one ("experiment 21"), may reasonably be increased by a local baseball expert telling me so and by evidence about his expertise. This holds even if I independently know something about baseball.

As to the effect of "number of parameters" on the theories' probabilities, would you bet equally on the two theories if you were told that they contained an identical number of parameters? I wouldn't, given the asymmetric information contained in the two experts vouching for the theories.

Tim, I agree that if you remove the distinct scientists and have the hypotheses produced instead by a single process (drawn from the same bucket), you should prefer whichever prediction has the highest prior probability. Do you mean that the prior probability is equal to the prediction's simplicity or just that simplicity is a good rule of thumb in assigning prior probabilities? If we have some domain knowledge I don't see why simplicity should correspond exactly to our priors; even Solomonoff Inducers move away from their initial notion of simplicity with increased data. (You've studied that math and I haven't; is there a non-trivial updated-from-data notion of "simplicity" that has identical ordinal structure to an updated Solomonoff Inducer's prior?)

Tyrrell, I like your solution a lot. A disagreement anyhow: as you say, if experts 1 and 2 are good probability theorists, T1 will contain the most likely predictions given the experimental results according to Expert 1 and T2 likewise according to Expert 2. Still, if the experts have different starting knowledge and at least one cannot see the other's predictions, I don't see anything that surprising in their "highest probability predictions given the data" calculations disagreeing with one another. This part isn't in disagreement with you, but it also relevant that if the space of outcomes is small or if experiments 1-20 are part of some local regime that experiment 21 is not(e.g. physics at macroscopic scales, or housing prices before the bubble broke), it may not be surprising to see two theories that agree on a large body of data and diverge elsewhere. Theories that agree in one regime and disagree in others seem relatively common.

Alex, Bertil, and others, I may be wrong, but I think we should taboo "overfitting" and "ad hoc" for this problem and substitute mechanistic, probability-theory-based explanations for where phenomena like "overfitting" come from.

## comment by Benya_Fallenstein (Benja_Fallenstein) · 2008-09-29T14:00:50.000Z · LW(p) · GW(p)

Tyrrell, right, thanks. :) Your formalization makes clear that P1/P2 = p(M(B1) predicts B2 | M flawed) / p(M(B1) predicts B2), which is a stronger result than I thought. Argh, I wish I were able to see this sort of thing immediately.

One small nitpick: It could be more explicit that in Assumption 2, B1 and B2 range over *actual* observation, whereas in Assumption 1, B ranges over *all possible* observations. :)

Anna, right, I think we need some sort of "other things being equal" proviso to Tyrrell's solution. If experiments 11..20 were chosen by scientist 1, experiment 21 is chosen by scientist 2, and experiments 1..10 were chosen by a third party, and scientist 2 knows scientist 1's theory, for example, we could speculate that scientist 2 has found a strange edge case in 1's formalization that 1 did not expect. I think I was implicitly taking the question to refer to a case where all 21 experiments are of the same sort and chosen independently -- say, lowest temperatures at the magnetic north pole in consecutive years, that sort of thing.

## comment by Tyrrell_McAllister2 · 2008-09-29T14:29:55.000Z · LW(p) · GW(p)

"One small nitpick: It could be more explicit that in Assumption 2, B1 and B2 range over actual observation, whereas in Assumption 1, B ranges over *all possible* observations. :)"

Actually, I implicitly was thinking of the "B" variables as ranging over actual observations (past, present, and future) in both assumptions. But you're right: I definitely should have made that explicit.

## comment by Dennis_Gorelik · 2008-09-29T14:42:58.000Z · LW(p) · GW(p)

We know that the first researcher is able to successfully predict the results of experiment. We don't know that about the second researcher. Therefore I would bet on the first researcher prediction (but only assuming other things being equal).

Then we'll do the experiment and know for sure.

## comment by Tim_Freeman · 2008-09-29T14:45:28.000Z · LW(p) · GW(p)

Benja --

I disagree with Tyrrell (see below), but I can give a version of Tyrrell's "trivial" formalization:

We want to show that:

Averaging over all theories T, P(T makes correct predictions | T passes 10 tests) > P(T makes correct predictions)

By Bayes' rule,

P(T makes correct predictions | T passes 10 tests) = P(T makes correct predictions)

- P(T passes 10 tests | T makes correct predictions) / P(T passes 10 tests)

So our conclusion is equivalent to:

Averaging over all theories T, P(T passes 10 tests | T makes correct predictions) / P(T passes 10 tests)

1

which is equivalent to

Averaging over all theories T, P(T passes 10 tests | T makes correct predictions) > P(T passes 10 tests)

which has to be true for any plausible definition of "makes correct predictions". The effect is only small if nearly all theories can pass the 10 tests.

I disagree with Tyrrell's conclusion. I think his fallacy is to work with the undefined concept of "the best theory", and to assume that:

If a theory consistent with past observations makes incorrect predictions then there was something wrong with the process by which that theory was formed. (Not true; making predictions is inherently an unreliable process.)

Therefore we can assume that that process produces bad theories with a fixed frequency. (Not meaningful; the observations made so far are a varying input to the process of forming theories.)

In the math above, the fallacy shows up because the set of theories that are consistent with the first 10 observations is different from the set of theories that are consistent with the first 20 observations, so the initial statement isn't really what we wanted to show. (If that fallacy is a problem with my understanding of Tyrrell's post, he should have done the "trivial" formalization himself.)

There are lots of ways to apply Bayes' Rule, and this wasn't the first one I tried, so I also disagree with Tyrrell's claim that this is trivial.

## comment by Tyrrell_McAllister2 · 2008-09-29T15:10:26.000Z · LW(p) · GW(p)

Hi, Anna. I definitely agree with you that two equally-good theories could agree on the results of experiments 1--20 and then disagree about the results of experiment 21. But I don't think that they could both be *best-possible* theories, at least not if you fix a "good" criterion for evaluating theories with respect to given data.

What I was thinking when I claimed that in my original comment was the following:

Suppose that T1 says "result 21 will be X" and theory T2 says "result 21 will be Y".

Then I claim that there is another theory T3, which correctly predicts results 1--20, and which also predicts "result 21 will be Z", where Z is a less-precise description that is satisfied by both X and Y. (E.g., maybe T1 says "the ball will be red", T2 says "the ball will be blue", and T3 says "the ball will be visible".)

So T3 has had the same successful predictions as T1 and T2, but it requires less information to specify (in the Kolmogorov-complexity sense), because it makes a less precise prediction about result 21.

I think that's right, anyway. There's definitely still some hand-waving here. I haven't proved that a theory's being vaguer about result 21 implies that it requires less information to specify. I think it should be true, but I lack the formal information theory to prove it.

But suppose that this can be formalized. Then there is a theory T3 that requires less information to specify than do T1 and T2, and which has performed as well as T1 and T2 on all observations so far. A "good" criterion should judge T3 to be a better theory in this case, so T1 and T2 weren't best-possible.

## comment by Jef_Allbright · 2008-09-29T15:40:11.000Z · LW(p) · GW(p)

Among the many excellent, and some inspiring, contributions to OvercomingBias, this simple post, together with its comments, is by far the most impactful for me. It's scary in almost the same way as the way the general public approaches selection of their elected representatives and leaders.

## comment by Benya_Fallenstein (Benja_Fallenstein) · 2008-09-29T16:02:12.000Z · LW(p) · GW(p)

Tyrrell, um. If "the ball will be visible" is a better theory, then "we will observe some experimental result" would be an *even* better theory?

Solomonoff induction, the induction method based on Kolmogorov complexity, requires the theory (program) to output the precise experimental results of all experiments so far, and in the future. So your T3 would not be a single program; rather, it would be a set of programs, each encoding specifically one experimental outcome consistent with "the ball is visible." (Which gets rid of the problem that "we will observe some experimental result" is the best possible theory :))

## comment by Nominull3 · 2008-09-29T16:17:15.000Z · LW(p) · GW(p)

Here is my answer without looking at the comments or indeed even at the post linked to. I'm working solely from Eliezer's post.

Both theories are supported equally well by the results of the experiments, so the experiments have no bearing on which theory we should prefer. (We can see this by switching theory A with theory B: the experimental results will not change.) Applying bayescraft, then, we should prefer whichever theory was *a priori* more plausible. If we could actually look at the contents of the theory we could make a judgement straight from that, but since we can't we're forced to infer it from the behavior of scientist A and scientist B.

Scientist A only needed ten experimental predictions of theory A borne out before he was willing to propose theory A, whereas scientist B needed twenty predictions of theory B borne out before he was willing to propose theory B. In absence of other information (perhaps scientist B is very shy, or had been sick while the first nineteen experiments were being performed), this suggests that theory B is much less *a priori* plausible than theory A. Therefore, we should put much more weight on the prediction of theory A than that of theory B.

If I'm lucky this post is both right and novel. Here's hoping!

## comment by Phil_Goetz5 · 2008-09-29T16:59:46.000Z · LW(p) · GW(p)

I've seen too many cases of overfitting data to trust the second theory. Trust the validated one more.

The question would be more interesting if we said that the original theory accounted for only some of the new data.

If you know a lot about the space of possible theories and "possible" experimental outcomes, you could try to compute which theory to trust, using (surprise) Bayes' law. If it were the case that the first theory applied to only 9 of the 10 new cases, you might find parameters such that you should trust the new theory more.

In the given case, I don't think there is any way to deduce that you should trust the 2nd theory more, unless you have some a priori measure of a theory's likelihood, such as its complexity.

## comment by Tyrrell_McAllister2 · 2008-09-29T17:22:17.000Z · LW(p) · GW(p)

Benja, I have never studied Solomonoff induction formally. God help me, but I've only read about it on the Internet. It definitely was what I was thinking of as a candidate for evaluating theories given evidence. But since I don't *really* know it in a rigorous way, it might not be suitable for what I wanted in that hand-wavy part of my argument.

However, I don't think I made quite so bad a mistake as highly-ranking the "we will observe some experimental result" theory. At least I didn't make that mistake in my own mind ;). What I actually wrote was certainly vague enough to invite that interpretation. But what I was thinking was more along these lines:

[looks up color spectrum on Wikipedia and juggles numbers to make things work out]

The visible wavelengths are 380 nm -- 750 nm. Within that range, blue is 450 nm -- 495 nm, and red is 620 nm -- 750 nm.

Let f(x) be the decimal expansion of (x - 380nm)/370nm. This moves the visible spectrum into the range [0,1].

I was imagining that T3 ("the ball is visible") was predicting

"The only digit to the left of the decimal point in f(color of ball in nm) is a 0 (without a negative sign)."

while T1 ("the ball is red") predicts

"The only digit to the left of the decimal point in f(color of ball in nm) is a 0 (without a negative sign), and the digit immediately to the right is a 7."

and T2 ("the ball is blue") predicts

"The only digit to the left of the decimal point in f(color of ball in nm) is a 0 (without a negative sign), and the digit immediately to the right is a 2."

So I was really thinking of all the theories T1, T2, and T3 as giving precise predictions. It's just that T3 opted not to make a prediction about something that T2 and T3 did predict on.

However, I definitely take the point that Solomonoff induction might still not be suitable for my purposes. I was supposing that T3 would be a "better" theory by some criterion like Solomonoff induction. (I'm assuming, BTW, that T3 did predict everything that T1 and T2 predicted for the first 20 results. It's only for the 21st result that T3 didn't give an answer as detailed as those of T1 and T2. ) But from reading your comment, I guess maybe Solomonoff induction wouldn't even compare T3 to T1 and T2, since T3 doesn't purport to answer all of the same questions.

If so, I think that just means the Solomonoff isn't quite general enough. There should be a way to compare two theories even if one of them answers questions that the other doesn't address. In particular, in the case under consideration, T1 and T2 are given to be "equally good" (in some unspecified sense), but they both purport to answer the same question in a different way. To my mind, that *should* mean that each of them isn't *really* justified in choosing its answer over the other. But T3, in a sense, acknowledges that there is no reason to favor one answer over the other. There should be some rigorous sense in which this makes T3 a better theory.

Tim Freeman, I hope to reply to your points soon, but I think I'm at my "recent comments" limit already, so I'll try to get to it tomorrow.

## comment by Kevin_Dick · 2008-09-29T17:33:48.000Z · LW(p) · GW(p)

Upon first reading, I honestly thought this post was either a joke or a semantic trick (e.g., assuming the scientists were themselves perfect Bayesians which would require some "There are blue-eyed people" reasoning).

Because theories that can make accurate forecasts are a small fraction of theories that can make accurate hindcasts, the Bayesian weight has to be on the first guy.

In my mind, I see this visually as the first guy projecting a surface that contains the first 10 observations into the future and it intersecting with the actual future. The second guy just wrapped a surface around his present (which contains the first guy's future). Who says he projected it in the right direction?

But then I'm not as smart as Eliezer and could have missed something.

## comment by Tim_Tyler · 2008-09-29T18:44:41.000Z · LW(p) · GW(p)

We have two theories that explain the all the available data - **and** this is *Overcoming Bias* - so how come only a *tiny* number of people have mentioned the possibility of using *Occam's razor*? Surely that **must** be part of any sensible response.

## comment by George_Weinberg2 · 2008-09-29T18:45:36.000Z · LW(p) · GW(p)

I don't think you've given enough information to make a reasonable choice. If the results of all 20 experiments are consistent with both theories but the second theory would not have been made without the data from the second set of experiments, then it stands to reason that the second theory makes more precise predictions.

If the theories are equally complex and the second makes more precise predictions, then it appears to be a better theory. If the second theory contains a bunch of ad hoc parameters to improve the fit, then it's likely a worse theory.

But of course the original question does not say that the second theory makes more precise predictions, nor that it would not have been made without the second set of experiments.

## comment by AnnaSalamon · 2008-09-29T18:55:40.000Z · LW(p) · GW(p)

Hi Tyrrell,

Let T1_21 and T2_21 be the two theories' predictions for the twenty-first experiment.

As you note, if all else is equal, our prior beliefs about P(T1_21) and P(T2_21) -- the odds we would've accepted on bets before hearing T1s and T2's predictions -- are relevant to the probability we should assign after hearing T1's and T2's predictions. It takes more evidence to justify a high-precision or otherwise low-prior-probability prediction. (Of course, by the same token, high precision and otherwise low-prior predictions are often more useful.)

The precision (or more exactly, the prior probability) of the predictions T1 and T2 assign to the * first * twenty experimental results are also relevant. The precision of these tested predictions, however, pulls in the opposite direction: if theory T1 made extremely precise, low-prior-probability predictions * and got them right *, this should more strongly increase our prior probability that T1's set of predictions is entirely true. You can formalize this with Bayes' theorem. [However, the obvious formalization only shows how probability of the conjunction of all of T1's predictions increases; you need a model of how T1 and T2 were generated to know how indicative each theory's track record is of its future predictive accuracy, or how much your beliefs about P(T1_21) specifically should increase. If you replace the scientists with random coin-flip machines, and your prior probability for each event is (1/2), T1's past success shouldn't increase your P(T1_21) belief at all.]

As to whether there is a single "best" metric for evaluating theories, you are right that for any one expert, with one set of starting (prior) beliefs about the world and one set of data with which to update those beliefs, there will be exactly one best (Bayes'-score-maximizing) probability to assign to events T1_21 and T2_21. However, if the two experts are working from non-identical background information (e.g., if one has background knowledge the other lacks), there is no reason to suppose the two experts' probabilities will match even if both are perfect Bayesians. If you want to stick with the Solomonoff formalism, we can make the same point there: a given Solomonoff inducer will indeed have exactly one best (probabilistic) prediction for the next experiment. However, two different Solomonoff inducers, working from two different UTM's and associated priors (or updating to two different sets of observations) may disagree. There is no known way to construct a perfectly canonical notion of "simplicity", "prior probability" or "best" in your sense.

If you want to respond but are afraid of the "recent comments" limit, perhaps email me? We're both friends of Jennifer Mueller's (I think. I'm assuming you're the Tyrrell McAllister she knows?), so between that and our Overcoming Bias intersection I've been meaning to try talking to you sometime. annasalamon at gmail dot com.

Also, have you read A Technical Explanation ? It's brilliant on many of these points.

## comment by Will_Pearson · 2008-09-29T18:58:33.000Z · LW(p) · GW(p)

A few thoughts.

I would like the one that:

0) Doesn't violate any useful rules of thumb, e.g. conservation of energy, allowing transmitting information faster than the speed of light in a vacuum. 1) Gives more precise predictions. To be consistent with a theory isn't hard if the theory gives a large range of uncertainty. E.g. if one theory is 2) Doesn't have any infinities in its range

If all these are equal, I would prefer them equally. Otherwise I would have to think that something was special about the time they were suggested, and be money pumped.

For example: Assume that I was asked this question many times, but my memory wiped in between times. If I preferred the predicting theory, they could alternate which scientist discovered the theory first, and charge me a small amount of money to get the first guys theory, but get the explanatory one for free. So I would be forever switching between theories, purely on their temporalness. Which seems a little weird.

## comment by Tom_Breton_(Tehom) · 2008-09-29T20:08:42.000Z · LW(p) · GW(p)

As a machine-learning problem, it would be straightforward: The second learning algorithm (scientist) did it wrong. He's supposed to train on half the data and test on the other half. Instead he trained on all of it and skipped validation. We'd also be able to measure how relatively complex the theories were, but the problem statement doesn't give us that information.

As a human learning problem, it's foggier. The second guy could still have honestly validated his theory against the data, or not. And it's not straightforward to show that one human-readable theory is more complex than another.

But with the information we're given, we don't know anything about that. So ISTM the problem statement has abstracted away those elements, leaving us with learning algorithms done right and done wrong.

## comment by James_D._Miller · 2008-09-29T21:24:45.000Z · LW(p) · GW(p)

We should take into account the costs to a scientist of being wrong. Assume that the first scientist would pay a high price if the second ten data points didn't support his theory. In this case he would only propose the theory if he was confident it was correct. This confidence might come from his intuitive understanding of the theory and so wouldn't be captured by us if we just observed the 20 data points.

In contrast, if there will be no more data the second scientist knows his theory will never be proved wrong.

## comment by James_D._Miller · 2008-09-29T21:46:51.000Z · LW(p) · GW(p)

Sorry, I misread the question. Ignore my last answer.

## comment by steven · 2008-09-29T22:43:00.000Z · LW(p) · GW(p)

Peter de Blanc got it right, IMHO. I don't agree with any of the answers that involve inference about the theorists themselves; they each did only one thing, so it is not the case that you can take one thing they did as evidence for the nature of some other thing they did.

## comment by DonGeddis · 2008-09-29T22:44:00.000Z · LW(p) · GW(p)

Peter de Blanc is right: Theories screen off the theorists. It doesn't matter what data they had, or what process they used to come up with the theory. At the end of the data, you've got twenty data points, and two theories, and you can use your priors in the domain (along with things like Occam's Razor) to compute the likelihoods of the two theories.

But that's not the puzzle. The puzzle doesn't give us the two theories. Hence, strictly speaking, there is no correct answer.

That said, we can start guessing likelihoods for what answer we *would* come up with, if we knew the two theories. And here what is important is that all we know is that both theories are "consistent" with the data they had seen so far. Well, there are an infinite number of consistent theories for any data set, so that's a pretty weak constraint.

Hence people are jumping into the guess that scientist #2 will "overfit" the data, given the extra 10 observations.

But that's not a conclusion you ought to make before seeing the details of the two theories. Either he did overfit the data, or he didn't, but we can't determine that until we see the theories.

So what it comes down to is that the first scientist has less opportunity to overfit the data, since he only saw the first 10 points. And, its successful prediction of the next 10 points is reasonable evidence that theory #1 is on the right track, whereas we have precious little evidence (from the puzzle) about theory #2.

But this doesn't say that theory #1 _is*better* than theory #2. It only says that, if we ever had the chance to actually correctly evaluate both theories (using Bayesian priors on both theories and *all* the data), then we currently expect theory #1 will win that battle more often than theory #2.

But that's a weak, kind of indirect, conclusion.

## comment by Alan_Gunn · 2008-09-29T22:51:00.000Z · LW(p) · GW(p)

The short answer is, "it depends." For all we can tell from the statement of the problem, the second "theory" could be "I prayed for divine revelation of the answers and got these 20." Or it could be special relativity in 1905. So I don't think this "puzzle" poses a real question.

## comment by DonGeddis · 2008-09-29T22:52:00.000Z · LW(p) · GW(p)

Oh, and Thomas says: *"There is no way to choose one, except to make another experiment and see which theory - if any (still might be both well or both broken) - will prevail."*

Which leads me to think he is constrained by the Scientific Method, and hasn't yet learned the Way of Bayes.

## comment by steven · 2008-09-29T23:25:00.000Z · LW(p) · GW(p)

Actually I'd like to take back my last comment. To the extent that predictions 11-20 and 21-30 are generated by different independent "parts" of the theory, then the quality of the former part is evidence about the quality of the latter part via the theorist's competence.

## comment by Aron · 2008-09-29T23:33:00.000Z · LW(p) · GW(p)

Of course you can make an inference about the *evidenced* skill of the scientists. Scientist 1 was capable of picking out of a large set of models that covered the first 10 variables, the considerably smaller set of models that also covered the second 10. He did that by reference to principles and knowledge he brought to the table about the nature of inference and the problem domain. The second scientist has not shown any of this capability. I think our prior expectation for the skill of the scientists would be irrelevant, assuming that the prior was at least equal for both of them.

Peter: "The first theorist had less data to work with, and so had less data available to insert into the theory as parameters. This is evidence that the first theory will be smaller than the second theory"

The data is not equivalent to the model parameters. A linear prediction model of [PREDICT_VALUE = CONSTANT * DATA_POINT_SEQUENCE_NUMBER] can model an infinite number of data points. Adding more data points does not increase the model parameters. If there is a model that predicts 10 variables, and subsequently predicts another 10 variables there is no reason to add complexity unless one prefers complexity.

## comment by unicorn · 2008-09-30T00:17:00.000Z · LW(p) · GW(p)

Ceteris paribus, I'd choose the second theory since the process that generated it had strictly more information. Assume that the scientists would've generated the same theory given the same data, and the data in question are coin flips. The first scientist sees a random looking series of 10 coin flips with 5 heads and 5 tails and hypothesizes that they are generated by the random flips of a fair coin. We collect 10 more data points, and again we get 5 heads and 5 tails, the maximum likelihood result given the first theory. Now the second scientist sees the same 20 coin flips, and notices that the second series of 10 flips exactly duplicates the first. So the second scientist hypothesizes that the generating process is deterministic with a period of 10 flips. So even though the same 20 data points are maximum likelihood given both theories, the second theory assigns them more probability mass. I think this becomes more salient intuitively if we imagine increasing the length of the repeating series to, say, 1,000,000.

## comment by Thanatos_Savehn · 2008-09-30T00:40:00.000Z · LW(p) · GW(p)

Experience alone leads me to pick Theory #2. In what I do I'm constantly battling academic experts peddling Theory #1. Typically they have looked at say 10 epidemiological studies and concluded that the theory "A causes B" is consistent with the data and thus true. A thousand lawsuits against the maker of "A" are then launched on behalf of those who suffer from "B".

Eventually, and almost invariably with admittedly a few notable exceptions, the molecular people then come along and more convincingly theorize that "C causes A and B" such that the reason plaintiff was say ingesting "A" and then suffering from "B" was because "C" was producing an urge/need for "A" as an early symptom of the eventual "B". Or, to take a more concrete example, it may be that they demonstrate that plaintiffs were exposed to "A" (e.g. a vaccine) at the typical age of onset (e.g. autism) of "B" so that the perceived causal connection was merely coincidental.

The ability to identify and control for confounders is for some reason (perhaps to do with overcoming bias) heightened in the second set of eyes to review data and the theories they generate.

## comment by steven · 2008-09-30T00:47:00.000Z · LW(p) · GW(p)

I wrote:

*To the extent that predictions 11-20 and 21 are generated by different independent "parts" of the theory, the quality of the former part is evidence about the quality of the latter part via the theorist's competence.*

...however, this is much less true of cases like Newton or GR where you can't change a small part of the theory without changing all the predictions, than it is of cases like "evolution theory is true and by the way general relativity is also true", which is really two theories, or cases like "Newton is true on weekdays and GR on weekends", which is a bad theory.

So I think that to first order, Peter's answer is still right; and moreover, I think it can be restated from Occam to Bayes as follows:

Experiments 11-20 have given the late theorizer more information on *what false theories are consistent with the evidence*, but they have not given the early theorizer any usable information on what false theories are consistent with the evidence. Experiments 11-20 have also given the late theorizer more information on *what theories are consistent with the evidence*, but this does not help the late theorizer relative to the early theorizer, whose theory after all turned out to be consistent with the evidence. So experiments 11-20 made it more likely for a random false late theory to be consistent with the evidence, relative to a random false early theory; but they did not make it more likely for a random late theory to be consistent with the evidence, relative to the early theory that was put forward. Therefore, according to some Bayes math that I'm too lazy to do, it must be the case that there are more false theories among late theories consistent with the evidence, than among early theories consistent with the evidence.

Does this make sense? I think I will let it stand as my final answer, with the caveat about theories with independent parts predicting different experiments, in which case our new information about the theorists matters.

## comment by simon2 · 2008-09-30T01:56:00.000Z · LW(p) · GW(p)

Let's suppose, purely for the sake of argument of course, that the scientists are superrational.

The first scientist chose the most probable theory given the 10 experiments. If the predictions are 100% certain then it will still be the most probable after 10 more successful experiments. So, since the second scientist chose a different theory, there is uncertainty and the other theory assigned an even higher probability to these outcomes.

In reality people are bad at assessing priors (hindsight bias), leading to overfitting. But these scientists are assumed to have assessed the priors correctly, and given this assumption you should believe the second explanation.

Of course, given more realistic scientists, overfitting may be likely.

## comment by stubydoo · 2008-09-30T02:26:00.000Z · LW(p) · GW(p)

The first theorist had multiple theories to choose from that would have been consistent with the first 10 data points - some of them better than others. Later evidence indicates that he chose well, that he apparently has some kind of skill in choosing good theories. No such evidence is available regarding the skill of the second theorist.

## comment by Stephan · 2008-09-30T07:11:00.000Z · LW(p) · GW(p)

My approach: (using Bayes' Theorem explicitly)

A: first theory

B: second theory

D: data accumulated between the 10th and 20th trials

We're interested in the ratio P(A|D)/P(B|D).

By Bayes' Theorem:

P(A|D) = P(D|A)*P(A)/P(D)
P(B|D) = P(D|B)*P(B)/P(D)

Then

P(A|D)/P(B|D) = P(D|A)*P(A)/(P(D|B)*P(B)).

If each theory predicts the data observed with equal likelihood (that is, under neither theory is the data more likely to be observed), then P(D|A) = P(D|B) so we can simplify,

P(A|D)/P(B|D) = P(A)/P(B) >> 1

given that presumably theory A was a much more plausible prior hypothesis than theory B. Accordingly we have P(A|D) >> P(B|D), so we should prefer the first theory.

In practice, we these assumptions may not be warranted. In which case, we have to balance the likelihood of the priors (as we can best guess) and how well the theories predict the observed data (as we should be able to estimate directly from the theories).

## comment by Tim_Tyler · 2008-09-30T07:40:00.000Z · LW(p) · GW(p)

If you're handed the two hypothesies as black boxes, so that you can't actually see inside them and work out which is more complex, then go with the first one.

...unless you are attending a magic show. Fortunately, it is not common for scientists to be asked to choose between hypotheses without even knowing what they are.

## comment by anonymous3 · 2008-09-30T09:01:00.000Z · LW(p) · GW(p)

Suppose the scientists S_10 and S_20 are fitting curves f(i) to noisy observations y(i) at points i = 0...20. Suppose there are two families of models, a polynomial g(i;**a**) and a trigonometric h(i;Ï,Ï):

g(i) <- sum(a[k]*x^k, k=0..infinity)
h(i) <- cos(Ï*i+Ï)

The angular frequency Ï is predetermined. The phase Ï is random:

Ï ~ Flat(), equivalently Ï ~ Uniform(0, 2*Ï)

The coefficients a[k] are independently normally distributed with moments matched to the marginal moments of the coefficients in the Taylor expansion of h(i):

a[k] ~ Normal(mean=0, stddev=(Ï^k)/(sqrt(2)*factorial(k)))

There is some probability q that the true curve f(i) is generated by the trigonometric model h(i), and otherwise f(i) is generated by the polynomial model g(i):

isTrigonometric ~ Bernoulli(q)

f(i) <- if(isTrigonometric, then_val=h(i), else_val=g(i))

Noise is iid Gaussian:

n[i] ~ Normal(mean=0, stddev=Ï)

y[i] <- f(i) + n[i]

(The notation has been abused to use i as an index in n[i] and y[i] because each point i is sampled at most once.)

Scientists S_10 and S_20 were randomly chosen from a pool of scientists S*j having different beliefs about q. A known fraction s of the scientists in the pool understand that the trigonometric model is possible. Their belief q*{S*j} about the value of q for this problem is that q is equal to v. The remaining scientists do not understand that the trigonometric model is possible, and resort to polynomial approximations to predict everything. Their belief q*{S_j} about the value of q for this problem is that q equals 0:

understandsTrigonometricModel(S*j) ~ Bernoulli(s)
q*{S_j} <- if(understandsTrigonometricModel(S_j), then_val=v, else_val=0)

(As a variation, the scientists can have beta-distributed beliefs q_{S_j} ~ Beta(Î±, Î²).)

Both scientists report their posterior means for f(i) conditional on their knowledge. S_10 knows y[i] for i=0...9 and S_20 knows y[i] for i=0...19. Both scientists are Bayesians and know the probabilistic structure of the problem and the values of Ï and Ï. Both scientists also predict posterior means for f(20), and therefore for the observable y(20).

You are given the values of Ï, Ï, q, s, and v and the fact that, for each scientist, the mean of the squared differences between the posterior means for f(i) and the observations y[i] is less than Ï^2 ("the theory is consistent with the experiments"). You are not given the values y[i]. (You are also not given any information about any of the scientists' predictive densities at **y**, conditional or not, which is maddening if you're a Bayesian.) You are asked to choose a mixing coefficient t to combine the two scientists' predictions for y[20] into a mixed prediction y_t[20]:

y*t[20] <- t*y*{S*10}[20] + (1-t)*y*{S_20}[20]

Your goal in choosing t is to minimize the expectation of the squared error (y_t[20]-y[20])^2. For some example values of Ï, Ï, q, s, and v, what are the optimal values of t?

(In the variation with beta-distributed q_{S_j}, the optimal t depends on Î± and Î² and not on s and v.)

Note that if Ï is small, Ï is not small, q is not small, s is not small, and v is not small, then the given information implies with very high probability that isTrigonometric==True, that the first scientist understands that the trigonometric model is possible, and that the first scientist's posterior belief that the trigonometric model is correct is very high. (If the polynomial model had been correct, the first scientist's narrow prediction of y[10]...y[19] would have been improbable.) What happens when s is high, so that the second scientist is likely to agree? Would S_20 then be a better predictor than S_10?

In this formulation the scientists are making predictive distributions, which are not what most people mean by "theories". How do you draw the line between a predictive distribution and a theory? When people in this thread use the words "single best theory", what does that even mean? Even the Standard Model and General Relativity use constants which are only known from measurements up to an approximate multivariate Gaussian posterior distribution. Anyone who uses these physical theories to predict the "ideal" outcomes of experiments which measure physical constants must predict a distribution of outcomes, not a point. Does this mean they are using a "distribution over physical theories" and not a "single best physical theory"? Why do we even care about that distinction?

## comment by Stuart_Armstrong · 2008-09-30T10:03:00.000Z · LW(p) · GW(p)

Eveything else being equal, go for the first theory.

## comment by Benya_Fallenstein (Benja_Fallenstein) · 2008-09-30T12:28:00.000Z · LW(p) · GW(p)

Two points I'd like to comment on.

Re: *The second scientist had more information*

I don't think this is relevant if-- as I understood from the description-- the first scientist's theory predicted experiments 11..20 with high accuracy. In this scenario, *I don't think the first scientist should have learned anything that would make them reject their previous view*. This seems like an important point. (I think I understood this from Tyrrell's comment.)

Re: *Theories screen of theorists*

I agree-- we should pick the simpler theory-- *if* we're able to judge them for simplicity, and one is the clear winner. This may not be easy. (To judge General Relativity to be appropriately simple, we may have to be familiar with the discussion around symmetries in physics, not just with the formulas of GR, for example...)

I understood Tyrrell to say that both of the scientists are imperfect Bayesian reasoners, and so are we. If we were perfect Bayesians, both scientists and we would look at the data and immediately make the same prediction about the next trial. In practice, all three of us use some large blob of heuristics. Each such blob of heuristics is going to have a bias, and we want to pick the one that has the smaller expected bias. (If we formalize theories as functions from experiments to probability distributions of results, I think the "bias" would naturally be the Kullback-Leibler divergence between the theory and the true distribution.) Using Tyrrell's argument, it seems we can show that the first scientist's bias is likely to be smaller than the second scientist's bias (other things being equal).

## comment by Richard_Hollerith · 2008-09-30T17:14:00.000Z · LW(p) · GW(p)

The way science is currently done, experimental data that the formulator of the hypothesis did not know about is much stronger evidence for a hypothesis than experimental data he did know about.

A hypothesis formulated by a perfect Bayesian reasoner would not have that property, but hypotheses from human scientists do, and I know of no cost-effective way to stop human scientists from generating the effect. Part of the reason human scientists do it is because the originator of a hypothesis is too optimistic about the hypothesis (and this optimism stems in part from the fact that being known as the originator of a successful hypothesis is very career-enhancing), and part of the reason is because a scientist tends to stop searching for hypotheses once he has one that fits the data (and I believe this has been called motivated stopping on this blog).

Most of the time, these human biases will swamp the other considerations (except that consideration mentioned below) mentioned so far in these comments. Consequently, the hypothesis advanced by Scientist 1 is more probable.

Someone made a very good comment to the effect that Scientist 1 is probably making better use of prior information. It might be the case that that is another way of describing the effect I have described.

## comment by Richard_Hollerith · 2008-09-30T17:18:00.000Z · LW(p) · GW(p)

Whoever (E or Friedman) chose the title, "Prediction vs. Explanation", was probably thinking along the same lines.

## comment by Chris_Wegford · 2008-10-02T13:53:00.000Z · LW(p) · GW(p)

If I were to travel to the North Pole and live there through the months of January and February with no prior knowledge of the area, then I would almost certainly believe (one could even say Theorize) that it is constantly night time at the North Pole. I could move back to The United States, and may never know that my theory is wrong. If I had, however, stayed through March and maybe into April, I would then know that the Sun does eventually rise. From this extra information, I could postulate a new theory that would likely be more correct.

"The Sun rises and falls in months-long cycles at the North Pole", is, subjectively, more complex than "The sun never rises at the North Pole" and yet, the second theory is more correct.

A Theory based on more information (assuming the experiments were pure and variables were controlled) has to be more accurate. Fear of "Over-fitting" is a bias. The principle can only be used on hindsight after the mistakes are already known. Also, it would seem that "Over-fitting" is a product of human error. That we are given no information on the scientists running the experiments and doing the theories means we must assume that they are faithful and diligent, and in a word, Perfect.

Occam's Razor is itself a bias. It assumes human error in the form of over-complication via insufficient information. Given the information we have for this puzzle, we cannot use any tool that assumes any such thing. I vote that Occam's Razor sit this one out.

Given only what we have, even if E21 sides with T1 (Say that T1 = A is true, and T2 = A is true except when B. E21 yields A in spite of B.), then we must conclude T3 (A is true except when B, except when C), which will be closer to T2 than to T1.

T1 < T2 < T3 etc.

Now if we were given information on the Theories, Experiments, or Scientists, then it might be a completely different story and Occam's Razor might come off the sidelines. Until then though, I am of the opinion that this is the only logical conclusion to this puzzle using only the information we were given.

## comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2008-10-02T19:33:00.000Z · LW(p) · GW(p)

Remember the first comment way back in the thread? Psy-Kosh? I'm pretty much with him.

We assume that both hypotheses are equally precise - that they have equally pointed likelihood functions in the vicinity of the data so far.

If you know what's inside the boxes, and it's directly comparable via Occam's Razor, then Occam's Razor should probably take over.

The main caveat on this point is that counting symbols in an equation doesn't always get you the true prior probability of something, and the scientist's ability to predict the next ten symbols from the first ten symbols may suggest that his version of Occam's Razor / prior probability, is unusually good, if there's a dispute about which Razor or prior to use.

For example, it might be that each experiment only gives you 4 bits of data, and when you write out the first scientist's hypothesis in symbols, it comes out to 60 bits worth of causal network, or something like that. But it was the first hypothesis the scientist thought of, after seeing only 10 experiments or 40 bits worth of data - and what's more, it worked. Which suggests that the first scientist has a higher effective prior for that hypothesis, than the 60-bit Occam measurement of "counting symbols" would have you believe. Direct Occam stuff only gives you an upper bound on the probability, not a lower bound.

If you don't know what's inside the boxes or you don't have a good Occam prior for it, the first theory wins because the second black box is presumed to have used more of the data.

The main circumstance under which the second theory wins outright, is if you can look inside the boxes and the second theory is *strictly* simpler - that is, it captures all the successes so far, while containing *strictly* fewer elements - not just a shorter description, but a description that is a strict subset of the first. Then we just say that the first theory had a dangling part that needs snipping off, which there was never a reason to hypothesize in the first place.

## comment by Norman_Noman · 2008-10-04T02:59:00.000Z · LW(p) · GW(p)

The question is whether the likelihood that the 21st experiment will validate the best theory constructed from 20 data points and invalidate the best theory constructed from 10 data points, when that theory also fits the other ten, is greater than the likelihood scientist B is just being dumb.

The likelihood of the former is very hard to calculate, but it's definitely less than 1/11, in other words, over 91% of the time the first theory will still be, if not the best possible theory, good enough to predict the results of one more experiment. The likelihood that a random scientist, who has 20 data points and a theory that explains them, will come up with a different theory which is total crap, is easily more than 1 in 10.

Ergo, we trust theory A.

## comment by Norman_Noman · 2008-10-04T03:52:00.000Z · LW(p) · GW(p)

Part of the problem here is that the situation presented is an extremely unusual one. Unless scientist B's theory is deliberately idiotic, experiment 21 has to strike at a point of contention between two theories which otherwise agree, and it has to be the only experiment out of 21 which does so. On top of that, both scientists have to pick one of these theories, and they have to pick different ones. Even if those theories are the only ones which make any sense, and they're equally likely from the available data, your chance of ending up in the situation the problem presents is less than 1/100.

Even if you're given 21 data points that follow a pattern which could be predicted from the first 10, and you have to deliberately come up with a theory that fits the first 20 but not the 21st, it's quite tricky to do so. I would be surprised if anyone could come up with even a single example of the situation presented in this puzzle (or an analogous one with even more experiments) ever occurring in the real world.

Unless experiment 21 is of a different nature than experiments 1-20. A different level of precision, say. Then I'd go with scientist B, because with more data he can make a model that's more precise, and if precision suddenly matters a lot more, it's easy to see how he could be right and A could be wrong.

## comment by Norman_Noman · 2008-10-04T04:10:00.000Z · LW(p) · GW(p)

If instead of ten experiments per set, there were only 3, who here would pick theory B instead?

## comment by subod_83 · 2010-03-29T12:07:37.280Z · LW(p) · GW(p)

Since both theories satisfy all 20 experiments, for all intents and purposes of experimentation the theories are both equally valid or equally invalid.

Replies from: wedrifid## ↑ comment by wedrifid · 2010-03-29T13:09:46.010Z · LW(p) · GW(p)

My theory after seeing all 20 experiments is:

- The first in the series will be 'A'
- The second in the series will be 'B'
- The third in the series will be 'C'
- The fourth in the series will be 'D'
- The fifth in the series will be 'E'
- The sixth in the series will be 'F'
- The seventh in the series will be 'G'
- The eighth in the series will be 'H'
- The ninth in the series will be 'I'
- The tenth in the series will be 'J'
- The eleventh in the series will be 'K'
- The twelfth in the series will be 'L'
- The thirteenth in the series will be 'M'
- The fourteenth in the series will be 'N'
- The fifteenth in the series will be 'O'
- The sixteenth in the series will be 'P'
- The eighteenth in the series will be 'Q'
- The twentieth in the series will be 'R'
- The twenty first in the series will be a camel.

Those guys who have the theories "Each experiment will give a successive letter of the alphabet" and "Each experiment will give the next ASCII character" may be 'equally valid or invalid' but, well, they lack creativity, don't you think?

## comment by Houshalter · 2013-09-21T22:42:59.083Z · LW(p) · GW(p)

Imagine the ten experiments produced the following numbers as results: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

The first scientists hypothesis is this function: if n < 20 then n else 5, (where n is the number of the variable being tested in the experiment)

10 more experiments are done and of course it predicts the answers perfectly. Scientist two comes up with his hypothesis: n. That's it, just the value of n is the value that will be measured by the experiment.

Now, would you really trust the first hypothesis because it happened to have been made before the next ten experimental results were known?

In practice it's usually better to choose the hypothesis that has made more successful predictions in the past, because that is evidence that it isn't overfit. But complexity is also a more general way to keep overfitting in check. Useful when you can't just perform more experiments to test your hypothesises, or where it would be expensive to do so.