# Beautiful Probability

post by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2008-01-14T07:19:47.000Z · LW · GW · Legacy · 120 commentsShould we expect rationality to be, *on some level,* simple? Should we search and hope for *underlying* beauty in the arts of belief and choice?

Let me introduce this issue by borrowing a complaint of the late great Bayesian Master, E. T. Jaynes (1990):

"Two medical researchers use the same treatment independently, in different hospitals. Neither would stoop to falsifying the data, but one had decided beforehand that because of finite resources he would stop after treating N=100 patients, however many cures were observed by then. The other had staked his reputation on the efficacy of the treatment, and decided he would not stop until he had data indicating a rate of cures definitely greater than 60%, however many patients that might require. But in fact, both stopped with exactly the same data:

n= 100 [patients],r= 70 [cures]. Should we then draw different conclusions from their experiments?" (Presumably the two control groups also had equal results.)

According to old-fashioned statistical procedure - which I believe is still being taught today - the two researchers have performed different experiments with different stopping conditions. The two experiments *could* have terminated with different data, and therefore represent different tests of the hypothesis, requiring different statistical analyses. It's quite possible that the first experiment will be "statistically significant", the second not.

Whether or not you are disturbed by this says a good deal about your attitude toward probability theory, and indeed, rationality itself.

Non-Bayesian statisticians might shrug, saying, "Well, not all statistical tools have the same strengths and weaknesses, y'know - a hammer isn't like a screwdriver - and if you apply different statistical tools you may get different results, just like using the same data to compute a linear regression or train a regularized neural network. You've got to use the right tool for the occasion. Life is messy -"

And then there's the Bayesian reply: "Excuse *you*? The evidential impact of a fixed experimental method, producing the same data, depends on the researcher's private thoughts? And you have the nerve to accuse *us* of being 'too subjective'?"

If Nature is one way, the likelihood of the data coming out the way we have seen will be one thing. If Nature is another way, the likelihood of the data coming out that way will be something else. But the likelihood of a given state of Nature producing the data we have seen, has nothing to do with the researcher's private intentions. So whatever our hypotheses about Nature, the likelihood ratio is the same, and the evidential impact is the same, and the posterior belief should be the same, between the two experiments. At least one of the two Old Style methods must discard relevant information - or simply do the wrong calculation - for the two methods to arrive at different answers.

The ancient war between the Bayesians and the accursèd frequentists stretches back through decades, and I'm not going to try to recount that elder history in this blog post.

But one of the central conflicts is that Bayesians expect probability theory to be... what's the word I'm looking for? "Neat?" "Clean?" "Self-consistent?"

As Jaynes says, the theorems of Bayesian probability are just that, *theorems* in a coherent proof system. No matter what derivations you use, in what order, the results of Bayesian probability theory should always be consistent - every theorem compatible with every other theorem.

If you want to know the sum of 10 + 10, you can redefine it as (2 * 5) + (7 + 3) or as (2 * (4 + 6)) or use whatever other *legal* tricks you like, but the result always has to come out to be the same, in this case, 20. If it comes out as 20 one way and 19 the other way, then you may conclude you did something illegal on at least one of the two occasions. (In arithmetic, the illegal operation is usually division by zero; in probability theory, it is usually an infinity that was not taken as a the limit of a finite process.)

If you get the result 19 = 20, look hard for that error you just made, because it's unlikely that you've sent arithmetic itself up in smoke. If anyone should ever succeed in deriving a *real* contradiction from Bayesian probability theory - like, say, two different evidential impacts from the same experimental method yielding the same results - then the whole edifice goes up in smoke. Along with set theory, 'cause I'm pretty sure ZF provides a model for probability theory.

Math! That's the word I was looking for. Bayesians expect probability theory to be *math.* That's why we're interested in Cox's Theorem and its many extensions, showing that any representation of uncertainty which obeys certain constraints has to map onto probability theory. Coherent math is great, but unique math is even better.

And yet... *should* rationality be math? It is by no means a foregone conclusion that probability should be pretty. The real world is messy - so shouldn't you need messy reasoning to handle it? Maybe the non-Bayesian statisticians, with their vast collection of ad-hoc methods and ad-hoc justifications, are strictly more competent because they have a strictly larger toolbox. It's nice when problems are clean, but they usually aren't, and you have to live with that.

After all, it's a well-known fact that you can't use Bayesian methods on many problems because the Bayesian calculation is computationally intractable. So why not let many flowers bloom? Why not have more than one tool in your toolbox?

*That's* the fundamental difference in mindset. Old School statisticians thought in terms of *tools,* tricks to throw at particular problems. Bayesians - at least this Bayesian, though I don't think I'm speaking only for myself - we think in terms of *laws.*

Looking for laws isn't the same as looking for especially neat and pretty tools. The second law of thermodynamics isn't an especially neat and pretty refrigerator.

The Carnot cycle is an ideal engine - in fact, *the* ideal engine. No engine powered by two heat reservoirs can be more efficient than a Carnot engine. As a corollary, all thermodynamically reversible engines operating between the same heat reservoirs are equally efficient.

But, of course, you can't use a Carnot engine to power a real car. A real car's engine bears the same resemblance to a Carnot engine that the car's tires bear to perfect rolling cylinders.

Clearly, then, a Carnot engine is a useless *tool* for building a real-world car. The second law of thermodynamics, obviously, is not applicable here. It's too hard to make an engine that obeys it, in the real world. Just ignore thermodynamics - use whatever works.

This is the sort of confusion that I think reigns over they who still cling to the Old Ways.

No, you can't always do the exact Bayesian calculation for a problem. Sometimes you must seek an approximation; often, indeed. This doesn't mean that probability theory has ceased to apply, any more than your inability to calculate the aerodynamics of a 747 on an atom-by-atom basis implies that the 747 is not made out of atoms. Whatever approximation you use, it works to the extent that it approximates the ideal Bayesian calculation - and fails to the extent that it departs.

Bayesianism's coherence and uniqueness proofs cut both ways. Just as any calculation that obeys Cox's coherency axioms (or any of the many reformulations and generalizations) must map onto probabilities, so too, anything that is not Bayesian must fail one of the coherency tests. This, in turn, opens you to punishments like Dutch-booking (accepting combinations of bets that are sure losses, or rejecting combinations of bets that are sure gains).

You may not be able to compute the optimal answer. But whatever approximation you use, both its failures and successes will be *explainable* in terms of Bayesian probability theory. You may not know the explanation; that does not mean no explanation exists.

So you want to use a linear regression, instead of doing Bayesian updates? But look to the underlying structure of the linear regression, and you see that it corresponds to picking the best point estimate given a Gaussian likelihood function and a uniform prior over the parameters.

You want to use a regularized linear regression, because that works better in practice? Well, that corresponds (says the Bayesian) to having a Gaussian prior over the weights.

Sometimes you can't use Bayesian methods *literally;* often, indeed. But when you *can* use the exact Bayesian calculation that uses every scrap of available knowledge, you are done. You will never find a statistical method that yields a *better* answer. You may find a cheap approximation that works excellently nearly all the time, and it will be cheaper, but it will not be more accurate. Not unless the other method uses knowledge, perhaps in the form of disguised prior information, that you are not allowing into the Bayesian calculation; and then when you feed the prior information into the Bayesian calculation, the Bayesian calculation will again be equal or superior.

When you use an Old Style ad-hoc statistical tool with an ad-hoc (but often quite interesting) justification, you never know if someone else will come up with an even more clever tool tomorrow. But when you *can* directly use a calculation that mirrors the Bayesian law, you're *done* - like managing to put a Carnot heat engine into your car. It is, as the saying goes, "Bayes-optimal".

It seems to me that the toolboxers are looking at the sequence of cubes {1, 8, 27, 64, 125, ...} and pointing to the first differences {7, 19, 37, 61, ...} and saying "Look, life isn't always so neat - you've got to adapt to circumstances." And the Bayesians are pointing to the third differences, the underlying stable level {6, 6, 6, 6, 6, ...}. And the critics are saying, "What the heck are you talking about? It's 7, 19, 37 not 6, 6, 6. You are oversimplifying this messy problem; you are too attached to simplicity."

It's not necessarily simple on a *surface* level. You have to dive deeper than that to find stability.

Think laws, not tools. Needing to calculate approximations to a law doesn't change the law. Planes are still atoms, they aren't governed by special exceptions in Nature for aerodynamic calculations. The approximation exists in the map, not in the territory. You can know the second law of thermodynamics, and yet apply yourself as an engineer to build an imperfect car engine. The second law does not cease to be applicable; your knowledge of that law, and of Carnot cycles, helps you get as close to the ideal efficiency as you can.

We aren't enchanted by Bayesian methods merely because they're beautiful. The beauty is a side effect. Bayesian *theorems* are elegant, coherent, optimal, and provably unique because they are *laws.*

**Addendum**: Cyan directs us to chapter 37 of MacKay's excellent statistics book, free online, for a more thorough explanation of the opening problem.

Jaynes, E. T. (1990.) Probability Theory as Logic. In: P. F. Fougere (Ed.), *Maximum Entropy and Bayesian Methods.* Kluwer Academic Publishers.

MacKay, D. (2003.) Information Theory, Inference, and Learning Algorithms. Cambridge: Cambridge University Press.

## 120 comments

Comments sorted by oldest first, as this post is from before comment nesting was available (around 2009-02-27).

## comment by Emil_Gilliam · 2008-01-14T07:44:10.000Z · LW(p) · GW(p)

"This doesn't mean that probability theory has ceased to apply, any more than your inability to calculate the aerodynamics of a 747 on an atom-by-atom basis implies that the 747 is made out of atoms" should read "... is *not* made out of atoms."

## comment by londenio · 2008-01-14T08:02:27.000Z · LW(p) · GW(p)

Eliezer,

I like your essays, but I feel that you are really beating a naive and unsophisticated frequentist straw man (straw person, politically correctly speaking). I think that the answer to the question "Should we draw different conclusions?" depends on some further assumptions about the process and about the type of conclusions we want to make. What kind of frequentist would think that his research is free of subjective assumptions? A naive one.

I admit that I am out of my depth and I would like to know more about Jaynes example.

Replies from: Insert_Idionym_Here## ↑ comment by Insert_Idionym_Here · 2011-11-06T17:30:01.258Z · LW(p) · GW(p)

I cannot speak for Eliezer, but I can speak from my experience. Because you are reading what appears to be only one side of an issue, you cannot get all the facts. Whatever he may write, he cannot write beyond the constraints of the information he currently possesses. If you want to have the whole picture, you need to talk to, and observe, everything, and everyone, not just this blog. So, perhaps Eliezer is beating a straw man. Go talk to some more people, gather some more information, and find out.

## comment by Doug_S. · 2008-01-14T08:33:49.000Z · LW(p) · GW(p)

To answer your story about data:

One person decides on a conclusion and then tries to write the most persuasive argument for that conclusion.

Another person begins to write an argument by considering evidence, analyzing it, and then comes to a conclusion based on the analysis.

Both of those people type up their arguments and put them in your mailbox. As it happens, both arguments happen to be identical.

Are you telling me the first person's argument carries the exact same weight as the second?

In other words, yes, the researcher's private thoughts do matter, because P(observation|researcher 1) != P(observation|researcher 2) even though the observations are the same.

I think that's the proper Bayesian objection, anyway.

Replies from: Kenny, Vanilla_cabs## ↑ comment by Kenny · 2013-01-25T20:49:20.089Z · LW(p) · GW(p)

How can anyone other than the researchers themselves distinguish between them if their thoughts are private?

I understand "private thoughts" to imply that there are no other observable differences between the two researchers.

Replies from: homunq## ↑ comment by homunq · 2013-10-30T08:48:11.685Z · LW(p) · GW(p)

But in the Jaynes example we're talking about, there are clear observable differences. One had announced that he would continue until he got a certain proportion of success, the other had announced that he would stop at 100.

The key is that Jaynes gives a further piece of data: that somehow we know that "Neither would stoop to falsifying the data". In Bayesian terms, this information, if reliable, screens out our knowledge that their plans had differed. But in real life, you're never 100% certain that "neither would stoop to falsifying the data", especially when there's often more wiggle room than you'd realize about exactly which data get counted how. In that sense, a rigorous pre-announced plan may be useful evidence about whether there's funny business going on. The reviled "frequentist" assumptions, then, can be expressed in Bayesian terms as a prior distribution that assumes that researchers cheat whenever the rules aren't clear. That's clearly over-pessimistic in many cases (though over-optimistic in others; some researchers cheat even when the rules ARE clear); but, like other heuristics of "significance", it has some value in developing a "scientific consensus" that doesn't need to be updated minute-by-minute.

In general: sure, the world is Bayesian. But that doesn't mean that frequentist math isn't math. Good frequentist statistics is better than bad Bayesian statistics any day, and anyone who shuts their ears or perks them up just based on a simplistic label is doing themselves a disservice.

## ↑ comment by Vanilla_cabs · 2019-07-10T21:22:21.938Z · LW(p) · GW(p)

But, that's the thing : *P(observation|researcher 1) = P(observation|researcher 2)*

The individual patient results would not change whether it is researcher 1 or 2 leading the experiment. And given the 100 results that they had, both researchers would (and did) proceed exactly the same.

Replies from: Idan Arye## ↑ comment by Idan Arye · 2020-09-07T11:43:50.990Z · LW(p) · GW(p)

Maybe the second researcher was one of 20 researchers using the same approach, and he is the only one with a 70% success rate - the other 19 had success rates of about 1%. We have never heard of these other researchers, because having failed to reach 60% they are researching to this very day and are likely to never publish their results. When you have 10,000 cures out of a million patients, it'd take a nearly impossible lucky streak to be able to get nearly a million and a half more successes without getting a billion more failures along the way, given the likely probability of 1% and assuming you are using the same cure and not optimizing it along the research (which will make it a different beast entirely)

So, if we combine all the tests of all the 20 researches together, we have cures out of patients giving us a success rate of . But the fact that only our one researcher has published cherry-picks the tiny fraction of that data to get a 70% success rate.

Compare to the first researcher, who would have published anyway testing 100 patients - so if there were 19 more like him who would get 1% success rate they would still publish, and a meta research could show more accurate results.

This is an actual problem with science publications - journals are more likely to publish successful results than null results, effectively cherry-picking the results from the successful researches.

Replies from: Vanilla_cabs## ↑ comment by Vanilla_cabs · 2020-09-08T13:57:50.030Z · LW(p) · GW(p)

Maybe. But to assume any of that, you would need additional knoweledge. In the real world, in an actual case, you might have checked that there are 19 other researchers who used the same approach and that they all hid their findings. Whatever that additional knoweledge is that's allowing you to infer 19 hidden motivated researchers where only 1 is given, that is what gives you the ≈1% result.

Replies from: Idan Arye## ↑ comment by Idan Arye · 2020-09-09T12:28:41.278Z · LW(p) · GW(p)

I'm not inferring 19 more motivated researchers - that was just an example (the number 20 was picked because the standard threshold for significance is 5% which means one of out 20 researches that achieved this will be wrong). What I do infer is an *unknown* number of motivated researchers.

The key assumption here is that had the motivated researcher failed to meet the desired results, he would have kept researching without publishing and we would not know about his research. This implies that we do not know about any motivated researcher that failed to achieve their desired results - hence we can assume an unknown number of them.

The same cannot be said about the frugal researcher. If there were more frugal researchers but they all failed, they would have still published once they reached 100 patients and we would have still heard of them - so the fact we don't know about more frugal researchers really does mean there aren't any more frugal researchers.

Note that if my assumption is wrong, and in the other Everett branch where the motivated researcher failed we would have still known about his forever ongoing research, then in that case there really was no difference between them, because we could assign to the fact the motivated researcher is still researching the same meaning we assign to the frugal researcher publishing failed results.

-------

Consider a third researcher - one that's not as ethical as the first two, and plans on cherry-picking his results. But he decides he can be technically ethical if instead of cherry-picking the results inside each research he'd just cherry-pick the researches with desirable results. His plan is to research 100 patients, and if he can cure more than 60% of them he'll publish. Otherwise he'll just throw scrap that research's results and start a brand new research, with the same treatment but still technically a new research.

That third researcher is publishing results - it's 70 cures out of 100 patients. We know about his methods and we know about these results - and that's it. Should we just assume this is his only research and even though he intended to cherry-pick he happened to get this results on the first attempt, so we should treat them the same as we treat the frugal researcher's results?

Note that the difference between the motivated researcher and the cheating researcher is that the cheating researcher has to deliberately hide his previous researches (if there are any) while the motivated researcher simply doesn't now about his still researching peers (if there are any). But that's just a state of mind, and neither of them is lying about the research they did publish.

## comment by Ron_Hardin · 2008-01-14T09:02:54.000Z · LW(p) · GW(p)

95% confidence means that if you repeat the experiment you get the right answer 95% of the time.

That depends on your thoughts because what counts as a success comes up in the repeats.

The experiment itself does not tell you what would have counted as a success. It simply is. No confidence concept applies.

## comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2008-01-14T09:14:37.000Z · LW(p) · GW(p)

Emil, thanks, fixed.

Doug, your analogy is not valid because a biased reporting method has a different likelihood function to the possible prior states, compared to an unbiased one. In this case the single, fixed dataset that we see, has a different likelihood to the possible prior states, depending on the reporting method.

If a researcher who happens to be *thinking biased thoughts* carries out a fixed sequence of experimental actions, the resulting dataset we see does *not* have a different likelihood function to the possible prior states. All that a Bayesian needs to know is the experimental actions that were actually carried out and the data that was actually observed - not what the researcher was thinking at the time, or what other actions the researcher might have performed if things had gone differently, or what other dataset might then have been observed. We need only consider the actual experimental results.

Londenio, see Ron's comment - it's not a strawperson.

Replies from: Zian, None, Idan Arye## ↑ comment by Zian · 2013-07-06T05:19:29.802Z · LW(p) · GW(p)

Great point but I worry that people will point to this post and say "See? Publication bias/questionable study design/corporate funding/varying peer review processes don't matter!"

In other words, it's good to strive for a fixed experimental process but reality is rarely that tidy.

## ↑ comment by **[deleted]** ·
2013-12-14T17:08:24.375Z · LW(p) · GW(p)

Just a note here: the fact that a dataset has the same likelihood function regardless of the procedure that produced it is actually NOT a trivial statement - the way I see it, it a somewhat deep result which follows from the optional stopping theorem and the fact that the likelihood function is bounded. Not trying to nitpick, just pointing out that there is something to think about here. According to my initial intuitions, this was actually rather surprising - I didn't expect experimental results constructed using biased data (in the sense of non-fixed stopping time) to end up yielding unbiased results, even with full disclosure of all data.

Replies from: Eliezer_Yudkowsky## ↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-12-15T15:14:56.217Z · LW(p) · GW(p)

It's worth revising your intuitions if you found if surprising that a fixed physical act had the same likelihood to data regardless of researcher thoughts. It is indeed possible to see the mathematical result as "obvious at a glance".

Replies from: None, VAuroch## ↑ comment by **[deleted]** ·
2013-12-15T22:28:18.279Z · LW(p) · GW(p)

That's not quite what I meant. It is not the experimenter's thoughts that I am uncomfortable with- it is the collection of possible experimental outcomes.

I will try to illustrate with an example. Let us say that I toss a coin either (i) two times, or (ii) until it comes up heads. In the first case, the possible outcomes are HH, HT, TH, or TT; in the second case, they are H, TH, TTH, TTTH, TTTTH, etc. It isn't obvious to me that a TH outcome has the same meaning in both cases. If, for instance, we were not talking about likelihood and instead decided to measure something else, e.g. the portion of tosses landing on heads, this *wouldn't* be the case; in scenario (i), the expected portion of tosses landing on heads is 1/4 + .5/4 + .5/4 + 0/4 = .5, but in scenario (ii), it would be 1/2 + .5/4 + (1/3)/8 + .25/16 + ... = log(2); i.e. a little under .7.

## ↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-12-17T02:13:13.041Z · LW(p) · GW(p)

The TH outcome tells you the same thing *about the coin* because the coin *does not know what your plans were like*.

## ↑ comment by jazmt · 2013-12-17T02:54:18.407Z · LW(p) · GW(p)

Does the publication of the result tell you the same thing, since the fact that it was published is a result of the plans?

Replies from: None## ↑ comment by **[deleted]** ·
2013-12-18T19:40:55.356Z · LW(p) · GW(p)

I think in this case, we are assuming total and honest reporting of results (including publication); otherwise, we would be back to the story of filtered evidence. Therefore, the publication is not a result of the plans - it was going to happen in either case.

Replies from: jazmt## ↑ comment by jazmt · 2013-12-19T01:45:52.241Z · LW(p) · GW(p)

Thanks, I understood the mathematical point but was wondering if there is any practical significance since it seems in the real world that we cannot make such an assumption, and that in the real world we should trust the results of the two researchers differently (since the one researcher likely published no matter what, whereas the second probably only published the experiments which came out favorably (even if he didn't publish false information)). What is the practical import of this idea? In the real world with all of people's biases shouldn't we distinguish between the two researchers as a general heuristic for good research standards?

(If this is addressed in a different post on this site feel free to point me there since I have not read the majority of the site)

## ↑ comment by VAuroch · 2015-08-30T20:21:21.740Z · LW(p) · GW(p)

You can claim that it should have the same likelihood either way, but you have to put the discrepancy *somewhere*.
Knowing the choice of stopping rule is evidence about the experimenter's state of knowledge about the efficacy. You can say that it should be treated as a separate piece of evidence, or that knowing about the stopping rule should change your prior, but if you don't bring it in somewhere, you're ignoring critical information.

## ↑ comment by Cyan · 2015-08-31T02:22:41.084Z · LW(p) · GW(p)

you're ignoring critical information

No, it practical terms it's negligible. There's a reason that double-blind trials are the gold standard -- it's because doctors are as prone to cognitive biases as anyone else.

Let me put it this way: recently a pair of doctors looked at the available evidence and concluded (foolishly!) that putting fecal bacteria in the brains of brain cancer patients was such a promising experimental treatment that they did an end-run around the ethics review process -- and after leaving that job under a cloud, one of them was still considered a "star free agent". Well, perhaps so -- but I think this little episode illustrates very well that a doctor's unsupported opinion about the efficacy of his or her novel experimental treatment isn't worth the shit s/he wants to place inside your skull.

Replies from: EHeller, VAuroch## ↑ comment by EHeller · 2015-08-31T05:41:06.952Z · LW(p) · GW(p)

Hold on- aren't you saying the choice of experimental rule is VERY important (i.e. double blind vs. not double blind,etc)?

If so you are agreeing with VAuroch. You have to include the details of the experiment somewhere. The data does not speak for itself.

Replies from: Cyan## ↑ comment by Cyan · 2015-08-31T17:28:39.940Z · LW(p) · GW(p)

Of course experimental design is very important in general. But VAuroch and I agree that when two designs give rise to the same likelihood function, the information that comes in *from the data* are equivalent. We disagree about the weight to give to the information that comes in from what the choice of experimental design tells us about the *experimenter's prior state of knowledge*.

## ↑ comment by VAuroch · 2015-09-02T00:27:40.498Z · LW(p) · GW(p)

Double-blind trials aren't the gold standard, they're the best available standard. They still don't replicate far too often, because they don't remove bias (and I'm not just referring to publication bias). Which is why, when considering how to interpret a study, you look at the history of what scientific positions the experimenter has supported in the past, and then update away from that to compensate for bias which you have good reason to think will show up in their data.

In the example, past results suggest that, even if the trial was double-blind, someone who is committed to achieving a good result for the treatment will get more favorable data than some other experimenter with no involvement.

And that's on top of the trivial fact that someone with an interest in getting a successful trial is more likely to use a directionally-slanted stopping rule if they have doubts about the efficacy than if they are confident it will work, which is not explicitly relevant in Eliezer's example.

Replies from: Cyan## ↑ comment by Idan Arye · 2020-09-07T20:56:49.679Z · LW(p) · GW(p)

I think I figured where the source of confusion is. From the wording of the problem I assume that:

- The first researcher is going to publish anyways once he reaches 100 patients, no matter what the results are.
- The second researcher will continue as long as he doesn't meet his desired ratio, and had he not reached these results - he would continue forever without publishing and we'd never even heard of his experiment.

For the first researcher, a failure would update our belief in the treatment's effectiveness downward and a success would update it upward. For the second researcher, a failure will not update our belief - because we wouldn't even know the research existed - so for a success to update our belief upward would violate the Conservation of Expected Evidence [? · GW].

**But** - if we do know about the second researcher's experiment, we can interpret the fact that he didn't publish as a failure to reach a sufficient ratio of success, and update our belief down - which makes it meaningful to update our belief up when he publishes the results.

So - it's not about *state of mind* - it's about the researchers *actions* in other Everett branches where their experiments failed.

## comment by Elver · 2008-01-14T10:12:43.000Z · LW(p) · GW(p)

Something popped into my mind while I was reading about the example in the very beginning. What about research that goes out to prove one thing, but discovers something else?

Group of scientists want to see if there's a link between the consumption of Coca-Cola and stomach cancer. They put together a huge questionnaire full of dozens of questions and have 1000 people fill it out. Looking at the data they discover that there is no correlation between Coca-Cola drinking and stomach cancer, but there is a correlation between excessive sneezing and having large ears.

So now we have a group of scientists who set out to test correlation A, but found correlation B in the data instead. Should they publish a paper about correlation B?

Replies from: Baruta07, alex_zag_al## ↑ comment by Baruta07 · 2012-10-29T16:25:09.871Z · LW(p) · GW(p)

Before they publish anything (other than a article on Coca-Cola not being related to stomach cancer) they should first use a different test group in order to determine that the first result wasn't a sampling fluke or otherwise biased, (Perhaps sneezing wasn't causing large ears after all, or large ears were correlated to something that also caused sneezing.)

What brought the probability to your attention in the first place shouldn't be what proves it.

If A then B is a separate experiment than If C then D and should require separate additional proof.

Replies from: Kindly## ↑ comment by Kindly · 2012-10-29T20:30:14.263Z · LW(p) · GW(p)

That's a useful *heuristic* to combat our tendency to see patterns that aren't there. It's not strictly necessary.

Another way to solve the same problem is to look at the first 500 questionnaires first. The scientists then notice that there is a correlation between excessive sneezing and large ears. Now the scientists look at the last 500 questionnaires -- an independent experiment. If these questionnaires also show correlation, that is also evidence for the hypothesis, although it's necessarily weaker than if another 1000-person poll were conducted.

So this shows that a second experiment isn't necessary if we think ahead. Now the question is, if we've already foolishly looked at all 1000 results, is there any way to recover?

It turns out that what can save us is math. There's a bunch of standard tests for significance when lots of variables are compared. But the basic idea is the following: we can test if the correlation between sneezing and ears is high, by computing our prior for what sort of correlation the two most closely correlated variables would show.

Note that although our prior for two arbitrary variables might be centered at 0 correlation, our prior for two variables that are selected by choosing the highest correlation should be centered at some positive value. In other words: even if the questions were all about unrelated things, we expect a certain amount of correlation between some things to happen by chance. But we can figure out how much correlation to expect from this phenomenon! And by doing some math, we might be able to show that the correlation between sneezing and having ears is too high to be explained in this way.

Replies from: Baruta07, Houshalter## ↑ comment by Houshalter · 2013-06-16T07:06:05.349Z · LW(p) · GW(p)

There is other information to consider though. If there really was a correlation it's likely others would have noticed it in their studies. The fact that you haven't heard of it before suggests a lower prior probability.

Eventually someone just by chance will stumble upon seeming correlations that aren't really there. If you only publish when you find a correlation but not when you don't, then publication bias is created.

## ↑ comment by alex_zag_al · 2012-12-10T15:46:48.321Z · LW(p) · GW(p)

I have no idea about what's done in actual statistical practice, but it seems to make sense to do this:

Publish the likelihood ratio for each correlation. The likelihood ratio for the correlation being real and replicable will be very high.

Since they bothered to do the test, you can figure that people in the know have decently sized prior odds for the association being real and replicable. There must have been animal studies or a biochemical argument or something. Consequently, a high likelihood ratio for this hypothesis may been enough to convinced them - that is, when it's multiplied with the prior, the resulting posterior may have been high enough to represent the "I'm convinced" state of knowledge.

But the prior odds for the correlation being real and replicable are the same tiny prior odds you would have for any equally unsupported correlation. When they combine the likelihood ratio with their prior odds they do end up with a much higher posterior odds for than they do for other arbitrary-seeming correlations. But, still insignificant.

The critical thing that distinguishes the two hypotheses is *whatever previous evidence led them to attempt the test*; that's why the prior for the association is higher. It's subjective only in the sense that it depends on what you've already seen - it doesn't depend on your thoughts. Whereas, in what Kindly says is the standard solution, you apply a different test depending upon what the researcher's intentions were.

(I have no idea how you would calculate the prior odds. I mean, Solomonoff induction with your previous observations is the Carnot engine for doing it, but I have no idea how you would actually do it in practice)

## comment by Leo_G. · 2008-01-14T10:24:04.000Z · LW(p) · GW(p)

Doug S., I agree on principle, but disagree on your particular example because it is not statistical in nature. Should we not be hugging the query "Is the argument sound?" If a random monkey typed up a third identical argument and put it in the envelope, it's just as true. The difference between this and the a medical trial is that we have an independent means to verify the truth. Argument screens off Methodology...

If evidence is collected in violation of the fourth amendment rights of the accused, it's inadmissable in court, yes, but that doesn't mean that, ceteris paribus, the prosecution KNOWS LESS than if it were obtained legally.

So, when do I start agreeing with you? Here: The problem lies in the fact that the two trial methodologies create different sorts of Everett branches. The fact that the methodologies differed is ITSELF a piece of evidence which the esteemed Mr. Yudkosky doesn't appear to have room for in this Bayesian analysis. I agree that the relevant post appears to be What Evidence Filtered Evidence?

Replies from: Kenny## ↑ comment by Kenny · 2013-01-25T22:34:47.639Z · LW(p) · GW(p)

If there is evidence of the researcher's private thoughts, they aren't private.

In the hypothetical situation, an outside observer wouldn't know that the methodologies are different.

You're right to suspect that there *probably would* be evidence that the methodologies differed in a realistic scenario.

## comment by John5 · 2008-01-14T10:24:13.000Z · LW(p) · GW(p)

Hi, Im from Melbourne. All I can say is beware of any organisation or group of people who claim to be committed to over coming bias. Isnt that what all the "true believers" in his-story imagined that they were doing?

Heisenberg, and others told us that the very act of observing anything includes all sorts of unconscious biases which both effect the state of what is being observed, and via feedback, the state of the presumed observer, and of course, the results of any and every research project, no matter what it is. A person stuffed with hambugers obviously perceives everything differently than a Yogi who eats a purifying vegetarian diet, does pranayama (breathing) exercises, and hatha yoga too. The states of their nervous systems would be completely different, and hence their perceptions too. ALL perceptions are electronic fabrications and projections of the brain nervous system complex. We never see the world--only our brain created electronic projections---always, all of the time.

ALL thinking involves a bias---no exceptions. Why are you always thinking any way?

All of that is inevitable when a limited point of view to any act of observation. What is behind your head? What does the room look like from the point of view of your toe nail or from the other end of the room? What does the room look like from every possible point of view within the room? And from every possible point of view in space and time? Where is the room? Where, and who, and what,are you?

How can you explain the authentic world view communicated in the dot paintings of the best Australian so called "aboriginal" artists? It is radically different to our western way of seeing seeing, and yet completely coherent to an "aboriginal", and an integral part/expression of their culture altogether.

Perhaps it is far more real and ALIVE than our impoverished and power seeking objectification power/knowledge games which effectively reduce everything to dead matter---even if Bayesian methods are used.

The search for scientific "knowledge" is really hunter-gatherer behaviour which originates from the very primitive pre-civilized sections of the brain---the reptilian brain.

A quote from Carlos Casteneda.

" In European membership the "world" is built largely from what the eyes report to the mind. In sorcery the total body is used as a receptor. As Europeans we see a "world" out "there" and talk to ourselves about it. We are "here" and the "world" is "there". Our eyes feed our "reason" and we have no direct knowledge of things. According to sorcery this burden on the eyes is unnecessary. We know with the total body."

The only way bias could be truly over come would be to see the world from the Acausal perspective of God's eyes---the room and the universe in its totality before it is cut up into "objectified", and thus controlled. The process of objectifying any and everything is about gaining power and control----the Western gaze! This is what Western "culture" is all about. Power and control.

## comment by Leo_G. · 2008-01-14T10:34:30.000Z · LW(p) · GW(p)

Woops, looks like I may have shot myself in the foot. The same way argument screens off authority, the actual experiment that was run screens off the intentions of the researcher.

Efficacy of the drug -> Results of the experiment <- Bias of the researcher

Efficacy, Bias -> Results of the experiment -> Our analysis of the efficacy of the drug

## comment by Ben_Jones · 2008-01-14T12:29:32.000Z · LW(p) · GW(p)

Leo: *"...the actual experiment that was run screens off the intentions of the researcher."*

As long as the validity and reliability of the experiment itself aren't affected by the bias, then the findings are your territory. Analysis is the creation of the map, where all sorts of things can go awry.

Data may be confusing, or even misleading, but this is a fact about us, not the data. Data acquired from valid experiment does not lie, whatever your motives. It might just be telling you something you're not listening to.

*"The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'."* - Isaac Asimov

Slightly off-topic from the blog, apologies.

## comment by Tiedemies2 · 2008-01-14T12:55:36.000Z · LW(p) · GW(p)

I am sorry that I am too lazy to read this thoroughly, but to me the original problem seems a mere illusion and a strawman. A priori, the two experiments are different, but who cares? The experiment with its stopping condition yields a distribution of results only if you have some assumed a priori distribution over the patient population. If you change the stopping condition without changing this distribution, you change the experiment and you get a different distribution for the result. This has nothing to do with evidential impact. Frequentists don't, as far as I can tell, claim anything like that.

## comment by Tiedemies2 · 2008-01-14T12:56:35.000Z · LW(p) · GW(p)

I am sorry that I am too lazy to read this thoroughly, but to me the original problem seems a mere illusion and a strawman. A priori, the two experiments are different, but who cares? The experiment with its stopping condition yields a distribution of results only if you have some assumed a priori distribution over the patient population. If you change the stopping condition without changing this distribution, you change the experiment and you get a different distribution for the result. This has nothing to do with evidential impact. Frequentists don't, as far as I can tell, claim anything like that.

## comment by Nick_Tarleton · 2008-01-14T13:36:22.000Z · LW(p) · GW(p)

Are P(r>70|effective) and P(r>70|~effective) really the same in those two experiments? Trivially, at least, in the second one P(r<60)=0, unlike in the first, so the distribution of r over successive runs must be different. The sequences of experimental outcomes happened to be the same in this case, but not in the counterfactual case where fewer than 60 of the first 100 patients were cured, and it seems that in fact that would affect the likelihood ratio. (I may run a simulation when I have the time.)

## comment by Nick_Tarleton · 2008-01-14T13:41:09.000Z · LW(p) · GW(p)

Oh, wait: assuming the second researcher stops as soon as (r >= 60) AND (N >= 100) (the latter expression to explain that they kept going until r=70), then the distribution above 60 will actually not be any different (all the probability mass that was in r100, well, only the second experimenter could possibly have generated that result.

## comment by Caledonian2 · 2008-01-14T14:12:28.000Z · LW(p) · GW(p)

Are you telling me the first person's argument carries the exact same weight as the second?

Yes. It's the arguments that matter.

Now, if we know that one person was trying to support a thesis and the other presenting the data and drawing a conclusion, we can weight them differently, if we only have access to one. The first case might leave out contrary data and alternative hypotheses in an attempt to make the thesis look better. We expect the second case to mention all relevant data and the obvious alternatives, if only briefly, so the absence of contrary data is evidence of its nonexistence in that case.

Since we have both, we can exclude the possibility that the first author left out data to make his case look better. Thus, the two arguments are equally valid.

## comment by Richard_Hollerith2 · 2008-01-14T16:28:46.000Z · LW(p) · GW(p)

It is not completely unreasonable to believe that the big problem in medical research is not a lack of data or a lack of efficient statistical procedures to translate the data into conclusions, but rather the domination of the process by clever arguers. The old-fashioned procedure seems to penalize clever arguers. Although it is of course regrettable that the penalization is an accidental side effect of a misunderstanding of math, the old-fashioned procedure might in fact work better than a procedure whose sole objective is to squeeze as much knowledge from the data as possible.

## comment by Jef_Allbright · 2008-01-14T16:33:11.000Z · LW(p) · GW(p)

A good post on a profoundly beautiful subject, and a nice bit of jujutsu the way it works against the backdrop of the meta-commentary.

A minor quibble: Have you considered that use of "Law", like "The Way", while perhaps appropriately elevating, might work against your message by obscuring appreciation of increasingly general "principles"?

## comment by steven · 2008-01-14T16:58:42.000Z · LW(p) · GW(p)

For some good mockery of orthodox statistical concepts I recommend the writings of Guernsey McPearson.

## comment by Public_Heretic · 2008-01-14T17:29:40.000Z · LW(p) · GW(p)

Elizer says:

"We aren't enchanted by Bayesian methods merely because they're beautiful. The beauty is a side effect. Bayesian theorems are elegant, coherent, optimal, and provably unique because they are laws."

This seems deeply mistaken. Why should we believe that bayesian formulations are any more inherently "lawlike" than frequentist formulations? Both derive their theorems from within strict formal systems which begin with unchanging first principles. The fundamental difference between Bayesians and Frequentists seems to stem from different ontological assumptions about the nature of a probability distribution (Frequentists imagine a distribution as a set of possible outcomes which *would* have occurred under different realizations of our world, whereas Bayesians imagine a distribution as a description of single subjective mental state regarding a single objective world about which we are uncertain).

Moreover, doesn't Cox's Theorem imply that at a sufficient level of abstraction, any Bayesian derivation could (at least in principle) be creatively re-framed as a Frequentist derivation, since both must map (at some level) onto the basic rules of probability? It seems to me, that as far as the pure "math" is concerned, both frameworks have equal claim to "lawlike" status.

It therefore seems that what drives Eliezer (and many others, myself included) towards Bayesian formulations is a type of (dare I say it?) *bias* towards a certain kind of beauty which he has cleverly re-labeled as "law."

## ↑ comment by Дмитрий Зеленский (dmitrii-zelenskii) · 2019-08-16T17:52:39.380Z · LW(p) · GW(p)

If Bayesian derivation is a frequentist derivation, it does not mean that any frequentist derivation is necessarily equivalent to Bayesian. Mr. Yudkowsky claims, more or less, that Bayesian derivation is equivalent to the ideal frequentist derivation.

## comment by Public_Heretic · 2008-01-14T17:31:15.000Z · LW(p) · GW(p)

Elizer says:

"We aren't enchanted by Bayesian methods merely because they're beautiful. The beauty is a side effect. Bayesian theorems are elegant, coherent, optimal, and provably unique because they are laws."

This seems deeply mistaken. Why should we believe that bayesian formulations are any more inherently "lawlike" than frequentist formulations? Both derive their theorems from within strict formal systems which begin with unchanging first principles. The fundamental difference between Bayesians and Frequentists seems to stem from different ontological assumptions about the nature of a probability distribution (Frequentists imagine a distribution as a set of possible outcomes which *would* have occurred under different realizations of our world, whereas Bayesians imagine a distribution as a description of single subjective mental state regarding a single objective world about which we are uncertain).

Moreover, doesn't Cox's Theorem imply that at a sufficient level of abstraction, any Bayesian derivation could (at least in principle) be creatively re-framed as a Frequentist derivation, since both must map (at some level) onto the basic rules of probability? It seems to me, that as far as the pure "math" is concerned, both frameworks have equal claim to "lawlike" status.

It therefore seems that what drives Eliezer (and many others, myself included) towards Bayesian formulations is a type of (dare I say it?) *bias* towards a certain kind of beauty which he has cleverly re-labeled as "law."

## comment by Norman_Hall · 2008-01-14T18:11:43.000Z · LW(p) · GW(p)

I am by no means an expert in statistics, but I do appreciate Eliezer Yudkowsky's essay, and think I get his point that, given only experiment A and experiment B, as reported, there may be no reason to treat them differently IF WE DON'T KNOW of the difference in protocol (if those thoughts are truly private). But It does seem rather obvious that, if there were a number of independent experiments with protocol A and B, and we were attempting to do a meta-analysis to combine the results of all such experiments, there would be quite a number of experiments where n would be greater than 100 (from protocol B). With the protocol as stated, these would all end when cures were greater than but very close to 60%. If we assume that the "real" cure rate in the population is close to 70%, then, unless some Bayesian term is introduced to account for the bias in methodology, the meta-analysis would seem to be biased toward the incorrect conclusion that the lower 60% figure was closer to reality. Presumably, that kind of bias would be noticed in the experiments with n > 100, and could not have been kept as a private thought with a large number of repeat experiments.

I am not sure, but I would think that, if Bayesian analysis is (or can be) as rigorous as it is claimed, then even the analysis of the original pair might be expected to include some terms that would reflect that potential bias due to a difference in protocol IF THAT DIFFERENCE IS KNOWN to the Bayesian statistician doing the analysis. I find it disturbing that experiment A could have come out to have n = 100 and cure rate = 60%, or n = 1000 and cure rate = 60%, but not with cure rate = 59%, no matter how large n might have become.

## comment by Benquo · 2008-01-14T19:05:00.000Z · LW(p) · GW(p)

I second conchis's request. Shouldn't the second method cut against assumption of a randomized sample?

I'm also thinking of an analogy to the problem of only reporting studies that demonstrate the effectiveness of a drug, even if each of those studies on its own is fair. It seems to me as if stopping when and only when one gets the results one wants is similarly problematic, once again even if everything else about the experiment is strictly ok; outcomes that show 60%+ effectiveness are favored under that method, so P(real effectiveness!=60%|experimental effectiveness=60%) should be increased. Now, I understand why the effect here would be much smaller than in the case of simply leaving out inconvenient data, but I don't understand why I should think that it is exactly equal to zero.

## comment by Unknown · 2008-01-14T19:23:40.000Z · LW(p) · GW(p)

Conchis and Benquo: Eliezer's response to Doug was that the probability of a favorable argument is greater, given a clever arguer, than the prior probability of a favorable argument. But the probability of a 60% effectiveness given 100 trials, given an experimenter who intended to keep going until he had a 60% effectiveness, is no greater than the prior probability of a 60% effectiveness given 100 trials. This should be obvious, and does distinguish the case of the biased intentions from the case of the clever arguer.

## comment by Unknown · 2008-01-14T19:27:14.000Z · LW(p) · GW(p)

To make that claim more obvious: suppose I am involved in the argument between the Greens and the Blues, and after seeing the blue sky, I intend to keep looking up at the sky until it looks green. This won't make it any more probable that when I look at the sky tomorrow, it will look green. This probability is determined by objective facts, not by my intentions, and likewise with the probability of getting a 60% effectiveness from 100 trials.

## comment by tcpkac · 2008-01-14T19:33:17.000Z · LW(p) · GW(p)

I just saw an incredibly beautiful sunset. I also see the beauty in some of EY's stuff. Does that mean the sunset was Bayesian, or indeed subject to underlying lawfulness ? No, it only means my enhanced primate brain has a tendency to see beauty in certain things. Not that there is any more epistemic significance in a sunset than there is in a theorem.

## comment by Dave_D · 2008-01-14T21:10:53.000Z · LW(p) · GW(p)

I admit that I am still not quite sure what a "Bayesian" is as opposed to and "Old style" statistician (though I am very familiar with Bayes theorem, prior probabilities, likelihood ratios, etc).

That being said, the example at the beginning of the post is a great example of "after the fact" reasoning. If researcher number #2 had required 1,000 trials, then you could say that our interpretation of his results are the same as, say, "researcher #3" who set out to have 1,000 trials no matter how many cures were observed. Since (I would imagine) we would all agree that the conclusion of researcher #3's results are the same (if not stronger) than researcher #1's, than one must come to the conclusion that the interpretation of researcher #1's results are the same as researcher #2's, REGARDLESS of how many trial is takes researcher #2 to get his desired results. And (again, I *think*) we can all agree that just isn't the case.

I would also like to second Elver's comment above; it is another example of "after the fact" reasoning.

Sort of like shooting, and declaring whatever you hit to be your target.

## comment by Psy-Kosh · 2008-01-14T21:11:02.000Z · LW(p) · GW(p)

Had to actually think about it a bit, and I *think* it comes down to this:

The thing that determines the strength of evidence in favor of some hypothesis vs another is "what's the likelihood we would have seen E if H were true vs what's the likelihood we would have seen E if H were false"

Now. experimenter B is not at all filtering based on H being true or false, but merely the properties of E.

So the fact of the experimenter presenting the evidence E to us can only (directly) potentially give us additional information on the properties of the total evidence E that was collected, rather than (directly) telling us anything about H.

But... the "filtering" rule the experimenter uses is only when to stop experimenting. In other words, once the experimenter does present data E, we know that E is all the evidence there is that he collected. In other words, this isn't filtered evidence in the sense of the experimenter throwing away data he or she doesn't like because once we are given E, there's nothing more to know.

Let me clarify that: Imagine you didn't know the difference in the second experimenter's protocol, you had thought they were the same. Then later you learn the difference. Have you actually learned anything new? Is there any new info about E that you have that you didn't already believe you had?

In this case, no, because unlike filtered evidence situations, the information about the experimenter's intent has no affect on what other possible evidence there may have been that was hidden from you. The probability of you seeing this *specific* evidence, this specific chunk of data from experimenter A is the same as that from experimenter B, given either effectiveness or non effectiveness.

There're *other* patterns of data that one would expect to be possible to see from A but not from B, and other patterns that one would expect to possibly see from B but not from A, but these specific data sets being published have probability completely independant of which experimenter was doing it, right?

## comment by Cyan2 · 2008-01-14T23:58:55.000Z · LW(p) · GW(p)

There are some rather baroque kinds of prior information which would require a Bayesian to try to model the researcher's thought processes. They pretty much rely on the researcher having more information about the treatment effectiveness than is available to the Bayesian, and that the stopping rule depends on that extra information. This idea could probably be expressed more elegantly as a presence or absence of an edge in a Bayesian network, twiddling the d-separation of the stop-decision node with the treatment effectiveness node.

## comment by Fly2 · 2008-01-15T00:15:02.000Z · LW(p) · GW(p)

"So now we have a group of scientists who set out to test correlation A, but found correlation B in the data instead. Should they publish a paper about correlation B?"

Since you testing multiple hypotheses simultaneously, it is not comparable to Eliezer's example. Still, it is an interesting question...

Sure. The more papers you publish the better. If you are lucky the correlation may hold in other test populations and you've staked your claim on the discovery. Success is largely based on who gets credit.

Should a magazine publish papers reporting correlations with relatively high P-values? When thousands of scientists are data mining for genetic correlations to disease, chance correlations will be very common. If the genetic difference occurred in a metabolic pathway known to be relevant to the disease, the correlation might be publishable even with a high P-value. If the scientists just reported a random correlation they should have a low P-value.

A better approach might be to replace publication in a journal by some other mechanism. Suppose there were an online, centralized database for hypotheses related to a disease or trait. No single population study would be meaningful, but multiple reports by different researchers in different populations would be significant. Evidence would accumulate and credit would be shared among all those responsible for validating or disproving the hypothesis.

## comment by Benquo · 2008-01-15T02:09:31.000Z · LW(p) · GW(p)

Unknown, I still find it difficult to accept that there should be *literally zero* modification. It's important not just that n=100, but that n=100 *random* trials. Suppose both researchers reported 100% effectiveness with the same n, but researcher 2 threw out all the data points that suggested ineffectiveness? You still have an n=100 and a 100% effectiveness among that set, but any probability judgment that doesn't account for the selective method of picking the population is inadequate. I would suggest that either less or a different kind of information is transmitted when the magnitude of n (i.e. the sample size) is not a precondition, and is crafted to enhance a certain outcome. It's a subtler form of "throwing out the data" that doesn't agree with you.

Why isn't this just a much subtler version of the Monty Hall problem? It matters *when* and *how* the sample size was selected.

To put it in more concrete terms: if you assume, by hypothesis, that as n increases, the effectiveness percentage will fluctuate randomly due to statistical clustering, in the range between 54 and 61, then there is only an **infinitesimal** chance that, given truly infinite resources, researcher 2 from the original problem will **not** come up with the 60% figure, but the chance that researcher 1 will come up with the same figure is significantly less than 100%. If researcher 2 performs his experiment a large enough number of times, the distribution of his "n"s will be described by some kind of probability curve. And you will get 100% confirmation of the 60% figure. Whereas if you assign researcher 1 *a priori* the same distribution pattern of "n"s, his reported effectiveness percentage will be something like 57.5%. And you could come up with an infinite amount of evidence either way.

Now, of course, in real life the magnitude of n has some upper bound, so researcher 2 is playing a kind of martingale game with the facts, and I don't know enough math to be sure that doesn't precisely cancel out the difference, on average, between his data and that of researcher 1. But then again real life isn't a thought experiment (at least not one we're aware of), and researcher 2 is unlikely to abstain from some other subtle unconscious skewing of the data.

## comment by Caledonian2 · 2008-01-15T02:56:27.000Z · LW(p) · GW(p)

Suppose both researchers reported 100% effectiveness with the same n, but researcher 2 threw out all the data points that suggested ineffectiveness?

Yes, but that is not the case that was described to us.

The mindset of the researchers doesn't matter - only the procedure they follow does. And as unlikely as it may be, in the examples we're provided, the second researcher does not violate the proper procedure.

## comment by Paul_Gowder · 2008-01-15T02:56:47.000Z · LW(p) · GW(p)

I have to say, the reason the example is convincing is because of its artificiality. I don't know many old-school frequentists (though I suppose I'm a frequentist myself, at least so far as I'm still really nervous about the whole priors business -- but not quite so hard as all that), but I doubt that, presented with a stark case like the one above, they'd say the results would come out differently. For one thing, how would the math change?

But the case would never come up -- that's the thing. It's empty counterfactual analysis. Nobody who is following a stopping rule as ridiculous as the one offered would be able to otherwise conduct the research properly. I mean, seriously. I think Benquo nailed it: the second researcher's stopping rule ought to rather severely change our subjective probability in his/her having used a random sample, or for that matter not committed any number of other research sins, perhaps unconsciously. And that in turn should make us less confident about the results.

## comment by Paul_Gowder · 2008-01-15T03:06:20.000Z · LW(p) · GW(p)

Uh, strike the "how would the math change?" question -- I just read the relevant portion of Jaynes's paper, which gives a plausible answer to that. Still, I deny that an actual practicing frequentist would follow his logic and treat n as the random variable.

(ALSO: another dose of unreality in the scenario: what experimenter who decided to play it like that would *ever* reveal the quirky methodology?)

## comment by Psy-Kosh · 2008-01-15T03:17:34.000Z · LW(p) · GW(p)

Maybe in private notes?

And as far as thinking that N was the random variable in the second case, I had, I'd thought it through, and basically concluded that since no data at all was being hidden from us by experimenter B, and since A and B followed the same procedure, the probability that specific outcome would be published by B was the same as that of A

now, there is a partial caveat. One might say "but... What if B's rule was to only publish the moment he had at least 70%?"

So one might think there's more possible ways it could have come out like that for A then B, so that may affect things..

But no, and there's two ways to look at it...

One is if you ask "what is the *Specific* data A has?", ie, not just cures out of total trials, but the specific outcome of each patient, the likelihoods would then correspond to one specific outcome. And since the trials are (assumed) independant of each other given that the effectiveness of the medication is held fixed, rearanging the patients doesn't alter the likelihood. So they can be rearranged to match B's data without changing anything, even if we know B's data must have the property that the 100th trial was the first one to get to 70% or above.

So, what happens here.. you know that no matter what answer you get to the question "what, precisely, was A's data", the produced likelihoods would then end up being identical to the likelihood produced from B's data.

ie, by the principle of expectation of future belief has to equal present belief, you've gained same knowledge from both.

Another way to look at it is to consider that the only difference between the two, under these assumptions, would be a combinatorial coefficient. And that would be independant of the effectiveness itself given the numbers of how many cures vs non cures.

So that'll cancel out when computing the likelihood ratios.

## comment by Cyan2 · 2008-01-15T03:41:45.000Z · LW(p) · GW(p)

Paul Gowder,

You've read Jaynes -- now read MacKay.

"Information Theory, Inference, and Learning Algorithms" (available for download here).

The key portions are sections 37.2 - 37.3 (pp 462-465).

Replies from: None## ↑ comment by **[deleted]** ·
2021-01-28T10:17:05.046Z · LW(p) · GW(p)

No-one's going to read this reply, as I'm 13 years too late, but -- oh dear me -- MacKay makes hard work of the medical example.

A straightforward Frequentist solution follows. There are 4 positive cases among the 30 subjects in class A and 10 subjects in class B. We'll believe that treatment A is better if it's very unlikely that, were the treatments identical, there would be so relatively few positive cases among the A's. There are 91390 ways of picking 4 from 40, with only 3810 ~= 4.17% having 0 or 1 in class A. So, unless we were unlucky (4.17% chance) with the data, we can conclude that treatment A is better.

## comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2008-01-15T03:57:22.000Z · LW(p) · GW(p)

Cyan, excellent link!

## comment by Benquo · 2008-01-15T04:14:39.000Z · LW(p) · GW(p)

Caledonian,

I agree that the two cases are not precisely the same. I also agree that they are not, as a matter of degree, very close. But it seems to me that stopping at a desired result is implicitly the same as "throwing out" other possible results, if the desired result is one of the several results possible in the range of all feasible "n"s. In other words, what I meant by my "more concrete" example is that researcher 2's experiment is properly a member of the set of all possible type-2 experiments (all of which will produce 60%+), while researcher 1's experiment is one of the set of all possible type-1 experiments which may well produce a different number.

The method of selecting "n" matters. Look at Bertrand's paradox for a mathematical example: http://en.wikipedia.org/wiki/Bertrand%27s_paradox_%28probability%29 P(60% cured|n=100) is not the same thing as P(n=100|60% cured).

On the other hand, it might be that Mr. Yudkowsky mean us to read r=70 as implying that n=100 was the *minimum* n. In that case, provided we know researcher 2 didn't throw out another n=100 data set that he didn't like, I can see why the examples are equivalent; only once researcher 2 expands the data set are we to be suspicious. A good heuristic, then, might be in such cases to only look at the first n = "minimum number of data points" cases. This could save otherwise useless data.

I had assumed from the description of researcher 2's motives (as I think others did) that he performed the minimum number of tests to get the desired result, and that r=70/100 was offset by the control group.

## comment by Caledonian2 · 2008-01-15T04:52:44.000Z · LW(p) · GW(p)

But it seems to me that stopping at a desired result is implicitly the same as "throwing out" other possible results

You did not speak about throwing out *possible* results. You spoke of throwing out data that went against the desired conclusion.

These are very, very different actions, with different implications.

## comment by Paul_Gowder · 2008-01-15T07:00:40.000Z · LW(p) · GW(p)

Cyan, that source is slightly more convincing.

Although I'm a little concerned that it, too, is attacking another strawman. At the beginning of chapter 37, it seems that the author just doesn't understand what good researchers do. In the medical example given at the start of the chapter (458-462ish), many good researchers would use a one-sided hypothesis rather than a two-sided hypothesis (I would), which would better catch the weak relationship. One can also avoid false negatives by measuring the power of one's test. McKay also claims that "this answer does not say how much more effective A is than B." But that's just false: one can get an idea of the size of the effect with either sharper techniques (like doing a linear regression, getting beta values and calculating r-squared) or just by modifying one's null hypothesis (i.e. demanding that a datum improve on control by X amount before it counts in favor of the alternative hypothesis).

Given all that, I'm going to withhold judgment. McKay's argument on the coin flip example is convincing on the surface. But given his history from the prior pages of understating the counterarguments, I'm not going to give it credence until I find a better statistician than I to give me the response, if any, from a "sampling theory" perspective.

## comment by steven · 2008-01-15T13:23:05.000Z · LW(p) · GW(p)

It's worth noting that hypothesis testing as it's normally taught is a messy, confused hybrid of two approaches (Fischer and Neyman/Pearson), each of which is individually somewhat more elegant (but still doesn't make philosophical sense):

http://ftp.isds.duke.edu/WorkingPapers/03-26.pdf

http://marketing.wharton.upenn.edu/ideas/pdf/Armstrong/StatisticalSignificance.pdf

## comment by Cyan2 · 2008-01-15T16:47:11.000Z · LW(p) · GW(p)

Paul Gowder,

I agree with you that MacKay's Chi-squared example fails to criticize frequentist best practice. That said, all of the improvements you suggest seem to me to highlight the problem -- you have lots of tools in the toolbox, but only training and subjective experience can tell you which ones are most appropriate. On the question of "which approach is more subjective?", the frequentist advantage is illusory. (On the question of "which approach has the best philosophical grounding?" I go with the Cox theorems.)

## comment by Paul_Gowder · 2008-01-16T18:32:16.000Z · LW(p) · GW(p)

Cyan, I've been mulling this over for the last 23 hours or so -- and I think you've convinced me that the frequentist approach has worrisome elements of subjectivity too. Huh. Which doesn't mean I'm comfortable with the the whole priors business either. I'll think about this some more. Thanks.

## comment by Svante · 2008-01-21T22:15:47.000Z · LW(p) · GW(p)

As a full-blown Bayesian, I feel that the bayesian approach is *almost* perfect. It was a revelation when I first realized that instead of having this big frequentist toolbox of heuristics, one can simply assume that every involved entity is a random variable. Then everything is solved! But then pretty quickly I came to the catch, namely that to be able to do anything, the probability distributions must be parameterized. And then you start to wonder what the pdf's of the parameters should be, and off we go into infinite regress.

But the biggest catch is of course that the integral for the posterior is almost never solvable. If that wasn't the case, I believe we would have had superhuman AI a long time ago. Still, I think bayesian methods are underexploited in AI. For example, it is straight-forward to make a "curious" system that asks the user all the things it is uncertain of, in a way that minimizes the need for human input (My lab is currently working on such a system for auditory testing).

Replies from: moshez## ↑ comment by moshez · 2012-11-21T17:47:13.717Z · LW(p) · GW(p)

You don't need to solve the integral for the posterior analytically, you can usually Monte-Carlo your way into an approximation. That technique is powerful enough on reasonably-sized computers that I find myself doubting that this is the only hurdle to superhuman AI.

## comment by ChrisHibbert · 2009-07-27T06:38:10.796Z · LW(p) · GW(p)

"Two medical researchers use the same treatment independently [...] one had decided beforehand [...] he would stop after treating N=100 patients, [...]. The other [...] decided he would not stop until he had data indicating a rate of cures definitely greater than 60%, [...]. But in fact, both stopped with exactly the same data: n = 100 [patients], r = 70 [cures]. Should we then draw different conclusions from their experiments?"

[...]

If Nature is one way, the likelihood of the data coming out the way we have seen will be one thing. If Nature is another way, the likelihood of the data coming out that way will be something else. But the likelihood of a given state of Nature producing the data we have seen, has nothing to do with the researcher's private intentions. [...]

The expectations and the stopping rule make a difference. The reason the Monty Hall Puzzle turns out the way it does is that part of the setup is that Monty Hall **always** opens a different door than you chose. When I tell the story without mentioning that fact, you **should** get a different answer.

## comment by ksvanhorn · 2011-01-13T19:39:40.165Z · LW(p) · GW(p)

"If anyone should ever succeed in deriving a real contradiction from Bayesian probability theory [...] then the whole edifice goes up in smoke. Along with set theory, 'cause I'm pretty sure ZF provides a model for probability theory."

If you think of probability theory as a form of logic, as Jaynes advocates, then the laws and theorems of probability theory are the proof theory for this logic, and measure theory is the logic's model theory, with measure-theoretic probability spaces (which can be defined entirely with ZF, as you suggest) being the models.

## comment by LauWren · 2011-02-16T14:48:41.660Z · LW(p) · GW(p)

We have additional information about researcher 2's experiment. If researcher 2 didn't look at the data before that point, then the procedures were the same, so the data should be treated the same.

If researcher 2 did check the data along the way--a reasonable enough assumption, given researcher 2's goal--then there were other tests which *all* came out below 60%. There was an upswing in successes at the end, and we know it. The other experiment may well have experienced the same thing, but in experiment 2, I don't have to look; I see it. Was there an important variable that we overlooked?

Of course, this is only due to extra information I happen to have about the latter. I haven't bothered to check whether something similar was there for the previous, because there wasn't anything to make it catch my attention. If I have the tools to do so, I would still *like* to treat them both the same--I want to see researcher 1's results lined up by time as well. If the upswing is repeated...well, that's funny.

## ↑ comment by loqi · 2011-10-12T05:24:05.174Z · LW(p) · GW(p)

Good point, there is some ordering information leaked. This is consistent with identical likelihoods for both setups - learning *which permutation* of arguments we're feeding into a commutative operator (multiplication of likelihood ratios) doesn't tell us anything about its result.

## comment by Polymeron · 2011-04-13T13:39:03.768Z · LW(p) · GW(p)

Eliezer, I accept your point about the underlying laws of probability. However, your example is extremely flawed.

Of course what the researcher operates by should affect our interpretation of the evidence; it is, in itself, another piece of evidence! Specifically in this case, publishing your research only when you reach a certain conclusion implies that any similar researches that did not reach this threshold did not get published, and are thus not available to our evidence pool. This is filtered evidence.

So without knowing how many similar researches were conducted, the conclusion from the one research that did get published can't be seen as very strong. Do I need to draw the Bayesian analysis that shows why?

## comment by Khaled · 2011-05-15T01:42:10.536Z · LW(p) · GW(p)

Should we then draw different conclusions from their experiments?

I assume you mean if you only saw *one* of them (knowing the researcher's intentions ineither case)? In that case, I would say yes. For the first, the N is random, while for the second N is the smallest N were r>=60.
In the second case, the question is: what is the probability that the cure rate will *ever* reach 60%, while the first case answers the Q: what is the cure rate probability accoding to a sample on N=100

Yes, I would say, draw very differenct conclusions since you ar answering very different questions!

## comment by dlthomas · 2011-10-11T18:14:01.549Z · LW(p) · GW(p)

If there is a difference, it is not because the experiments went differently, it is because the experiments could have gone differently, and so the likelihoods of them happening the way they did happen is different.

The Monty Hall problem was mentioned above. I pick a door, Monty opens a door to reveal a goat, I can stick or switch (but can't take the goat). Whether Monty is picking a random door or picking the door he knows doesn't have the goat, the evidence is the same - Monty opened a door and revealed a goat. But if Monty what matters is what might have happened otherwise. If Monty always picks a door with a goat, then I win if I switch 2/3 of the time. If Monty might have picked the door with the car (and just happened not to), I win if I switch only 50% of the time.

Same evidence, different conclusions based solely on what someone might have done otherwise not based on what actually happened; and I am confident of the difference in the Monty Hall problem, as I have not only read about it but also simulated it.

In the situation given, Researcher 1 did stop at 100 experiments, but might have stopped at 49, or 280. Researcher 2 was sure to stop at 100. I am not unwilling to accept that this *doesn't* change the meaning of the evidence, in this case, but I do not understand at all why it should be "obvious" that it can't, given that it *does* in the case of the Monty Hall problem.

## ↑ comment by thomblake · 2011-10-11T20:13:36.607Z · LW(p) · GW(p)

Did you read the chapter linked at the end of the post?

A hopefully intuitive explanation: A spy watching the experiments and using Bayesian methods to make his own conclusions about the results, will not see any different evidence in each case and so will end up with the same probability estimate regardless of which experimenter he watched.

While the second experimenter might be contributing to publication bias by using that *method* in general, he nonetheless should not have come up with a different *result*.

It seems worth noting the tension between this and bottom-line reasoning. Could the second experimenter have come up with the desired result no matter what, given infinite time? And if so, is there any further entanglement between his hypothesis and reality?

Replies from: dlthomas## ↑ comment by dlthomas · 2011-10-11T20:29:16.152Z · LW(p) · GW(p)

Why would a spy watching Monty Hall be different?

Replies from: thomblake## ↑ comment by thomblake · 2011-10-11T20:45:21.547Z · LW(p) · GW(p)

Amongst other reasons, Monty isn't the experimenter. I'm really not sure in precisely what way Monty Hall is analogous to these experiments.

Replies from: dlthomas## ↑ comment by dlthomas · 2011-10-11T21:10:03.811Z · LW(p) · GW(p)

Monty Hall is analogous in that we are looking at evidence and trying to make conclusions about likelihoods. It is relevant because the likelihoods are different depending on what was in Monty's head in the past, after observing the same physical evidence. Monty is not the experimenter; where does that make a difference? Could one reformulate it so that he was? He would be running two different experiments, surely - but then why isn't that the case for the two researchers?

## ↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2011-10-11T21:15:43.906Z · LW(p) · GW(p)

The difference is that depending on Monty's algorithm, there is a different probability of getting the *exact* result we saw, namely seeing a goat. The *exact* event we *actually* saw happens with *different probability depending on Monty's rule*, so Monty's rule changes the meaning of that result.

The researchers don't get a given exact sequence of 100 results with different probability depending on their state of mind - their state of mind is not part of the state of the world that the result sequence tells us about, the way Monty's state of mind is part of the world that generates the exact goat.

To look at it another way, a spy watching Monty open doors and get goats would determine that Monty was deliberately avoiding the prize. Watching a researcher stop at 100 results doesn't tell you anything about whether the researcher planned to stop at 100 or after getting a certain number of successes. So, just like that result doesn't tell you anything about the researcher's state of mind, knowing about the researcher's state of mind doesn't tell you anything about the result.

Replies from: dlthomas, thomblake## ↑ comment by thomblake · 2011-10-13T00:31:41.256Z · LW(p) · GW(p)

Suppose that the frequency of cures actually converges at 60%, and each researcher performs his experiment 100 times. Researcher A should end up with about 6000 cures out of 10000, and Researcher B should end up with .7n cures out of n. We would expect in advance, after being told that Researcher B ended up testing 10000 people, that he encountered 7000 cures.

It seems that once we know that Researcher B was not going to stop until the incidence of cures was 70%, we do not learn anything further about the efficacy of the treatment by looking at his data.

What's wrong about this?

Replies from: pengvado## ↑ comment by pengvado · 2011-10-14T17:50:37.812Z · LW(p) · GW(p)

You are proposing to partition the data into two observations, namely "number of trials that were performed" and "the results of those trials".

The information you have after observing the first part and not the second, does depend on the researcher's stopping criterion. And the amount of information you gain from the second observation also so depends (since some of it was already known from the first observation in one case and not in the other). But your state of information after both observations does not depend on the stopping criterion.

Also, the most likely outcome of your proposed experiment is not .7n cures for some n. Rather, it's that Researcher B never stops.

Replies from: thomblake## comment by lmm · 2011-10-12T11:54:33.627Z · LW(p) · GW(p)

I was confused by this post for some time, and I feel I have an analagous but clearer example: Suppose scientist A says "I believe in proposition A, and will test it at the 95% confidence level", and scientist B says "I believe in proposition B, and will test it at the 99% confidence level". They go away and do their tests, and each comes back from their experiment with a p-value of 0.03. Do we now believe proposition A more or less than proposition B? The traditional scientific method, with its emphasis on testability, prefers A to B; for a bayesian it's clear that we have the same amount of evidence for each.

Have I fairly characterised both sides? Does this capture the same paradox as the original example, and is it any clearer?

## comment by dlthomas · 2011-10-26T15:59:01.130Z · LW(p) · GW(p)

Tentatively, I think we must treat the two differently, in some respect somewhere, or we are vulnerable to manipulation. Where does the flaw lie in the following?

If the second researcher had instead said, "I am going to run 1000 experiments with 100 people each, and publish only those whose cure rate exceeds 60%", there is a huge selection bias in the data we see and our update should be tiny if we can't get access to the discarded data.

If the researcher decided instead "I am going to run 1000 experiments in parallel, adding one person at a time to each, and publish only those whose cure rate exceeds 60%," then we would still see the same sort of bias in the data - after he has added 100 people to each experiment we are in the same situation as above.

If 1000 researchers, ideologically aligned in their commitment to this result, run experiments where they post their results when the cure rate exceeds 60%, this is equivalent to the second scenario.

Therefore, if enough people behaved like researcher 2, we would see biases in published data (which may conceivably balance out, with enough difference in ideology, but this does not seem to be guaranteed and I am not convinced it is even likely).

Therefore, researcher 2 should be considered to be acting badly, and we should reject (or at least weaken) experiments done in that fashion.

The above assumes that we do *not* see data from in-progress experiments. If we do, then I believe we can safely consider each data point with the same weight.

## ↑ comment by thomblake · 2011-10-26T17:16:15.514Z · LW(p) · GW(p)

That he discarded data is additional information to update on, that was not present in the above example. It doesn't matter what the researcher *intended*, if he got that data on the first try.

## ↑ comment by dlthomas · 2011-10-26T17:27:06.902Z · LW(p) · GW(p)

My point is that if we allow researchers to follow methodology like this, they will in practice be discarding data, and we will be unaware of it - even when each individual researcher is otherwise completely honest in how they deal with the data.

## comment by gRR · 2012-02-12T03:10:06.404Z · LW(p) · GW(p)

The argument that confused me at first was: "Wouldn't the second researcher always be able to produce a >60% result given enough time and resources, no matter what the actual efficacy of the treatment is?"

But this is not true. If the true efficacy is < 60%, then the probability of observing a ">60%" result at least once in a sequence of N experiments does not tend to 1 as N goes to infinity.

## comment by PhilosophyTutor · 2012-03-16T04:46:40.263Z · LW(p) · GW(p)

In the real world, Eliezer's example simply doesn't work.

In the real world you only hear about the results when they are published. The prior probability of the biased researcher publishing a positive result is higher than the prior probability of the unbiased researcher publishing a positive result.

The example only works if you are an omniscient spy who spies on absolutely all treatments. It's true that an omniscient spy should just collate all the data regardless of the motivations of the researcher spied upon. However unless you are an omniscient spy you do need to take into account how an agent showing you new data went about gathering the data that they are showing you. Otherwise you are vulnerable to being "mushroomed" and updating yourself into beliefs that will cause you to lose.

## comment by jsalvata · 2012-04-18T01:40:32.178Z · LW(p) · GW(p)

Elezier:

The results of the two experimenters in the example *are* different: to begin with, the 2nd experimenter's first result is a non-cure (otherwise he would have stopped there with a 100% success); one of the three following results is also a non-cure (otherwise he would have stopped with a 75%); etc. Also, his last result is a cure (otherwise he would have stopped one patient earlier).

The first experimenter certainly got different results -- or you may as well win the lottery: the odds that a Bernoulli trial produces a sequence x1..x100 in which no prefix x1...xN has a higher rate of successes than the whole sequence are really small.

Note that this argument gets weaker as you change the definition of "definitely greater than 60%" to require greater statistical confidence (indeed .99 results are less sensible to methodological biases than .95 results), but even at .99 the odds that the sequence obtained by the 1st doctor would finish exactly where the 2nd doctor would stop are well below 1/10th (I just made a quick upper bound calculation, it is actually even smaller).

The problem is that (1) when the results are reported in a journal, you only get the total counts -- which hides the methodological trap, and (2) even if you got the full results, you most likely don't have the computational power to discover the difference (except of course in the ~60% of reports from doctor 2 where he reports on a single patient).

## comment by mattwise · 2012-08-05T05:20:37.552Z · LW(p) · GW(p)

I don't feel sufficiently comfortable with statistics to tear apart the given example. I do have a different example with which to refute the point that the evidential impact of a fixed set of data should be independent of the researchers prior private thoughts.

Suppose I have two researchers, both looking at the correlations between acne and colored jelly beans. Alfred does twenty tests each with X subjects. Each test will feed subjects jelly beans of a single color for a week and then look at incidences of acne. Boris theorizes that green jelly beans are correlated with acne. Boris then does twenty test each with X subjects, identical to Alfred's test.

Alfred and Boris each use the exact same experimental procedure and each get the exact same results, finding higher rates of acne in subjects fed green jelly beans than in subjects fed other colored jelly beans. Boris' experiment is stronger evidence for a link between green jelly beans and acne than Alfred's experiment. Why? Because coincidences happen all the time.

Boris was looking for a correlation between green jelly beans and acne and the odds that Boris would find a correlation between green jelly beans and acne (by chance alone) was very low. Alfred was looking for a correlation but he wasn't specific about what correlation he was looking for. By chance alone, he was just as likely to find a correlation between blue jelly beans, or red jelly beans, or any of the 17 other colors in the experiment. The fact that this experiment happened to show higher rates with green jelly beans isn't worth very much evidence. Now if Alfred were to use this experiment to form a hypothesis that green jelly beans were correlated with acne and perform another experiment which ALSO showed a relationship between green jelly beans and acne, THEN he would have much stronger evidence.

Jelly beans inspired by XKCD: http://xkcd.com/882/

Having written this out, I'll offer a simpler example. Charlie and David each have a six sided die. They both think his die is weighted. Charlie says "I think this die will land '6' when I roll it". He rolls the die and it lands '6'. David just rolls the die and it lands '6'. They both now believe that his die will always land '6'. Exact same evidence. Charlie is more justified in his belief that the die is weighted to land '6' because if his die was not weighted, there would only be a 1/6 chance that he would have rolled a '6'. If David's die was not weighted, there would be a 6/6 chance that his die would roll to SOMETHING and generated his belief.

When evaluating the rational evidential impact of the results of an experiment, it is imperative that you take into account what it is that is being tested, and that is something that only exists in the experimenters private thoughts (or notebook or whatnot).

If you have twenty minutes, I would recommend this portion of a Radiolab broadcast about coincidences: http://www.radiolab.org/2009/jun/15/a-very-lucky-wind/

## comment by Sengachi · 2012-09-10T04:04:27.844Z · LW(p) · GW(p)

I think you're misunderstanding probability theory a little. Probability theory is the chance that x result means that the universe operates on principle y. The fact that the second researcher had no reason to stop until he tested 100 people means that the % of patients cured did not exceed 60% until his last batch of test subjects. Which significantly alters the chance that the universe operates on principle y. The first researcher could have had a % over 60 at any time during the experiment. Which is a physical difference. The probability used is not based on the researcher's private thoughts, it is based on their experimental procedure, which is different, regardless of the fact they ended with the same results.

Replies from: wedrifid, RobinZ, Kindly## ↑ comment by wedrifid · 2012-09-10T05:03:40.405Z · LW(p) · GW(p)

Probability theory is the chance that x result means that the universe operates on principle y.

Is there a typographical error in here? This seems like an odd claim as currently written. For instance was the word 'theory' accidentally included?

## ↑ comment by RobinZ · 2012-09-10T05:05:04.404Z · LW(p) · GW(p)

First off: welcome to Less Wrong! I hope you're enjoying the articles you've read so far.

That said: I don't think you really address the thrust of this article, here. The point is that all we have learned about *the principle the universe operates on* from either experiment is that said principle produced a given result - the *exact same result*, by stipulation. Any difference in the conclusions drawn from that result *must*, therefore, arise from difference in *assumptions about the universe* - because the two experimenters are working from literally *the same data*.

## ↑ comment by Kindly · 2012-09-10T12:31:37.806Z · LW(p) · GW(p)

Your skepticism is a good sign, but your mathematical intuition is lacking.

Suppose for concreteness that the second researcher did two batches of 50 experiments, and that we only consider two hypotheses: H1 (the treatment is 80% effective) and H2 (the treatment is 50% effective). Then you are quite correct in saying that Pr[n=100, r=70|H1] and Pr[n=100, r=70|H2] are different for the two scientists -- the second scientist needs, in addition, that the first batch of 50 had fewer than 30 successes. Algebraically, then, there seems to be no reason that the *odds* Pr[n=100, r=70|H1]/Pr[n=100, r=70|H2] would miraculously be the same for both scientists, right?

It turns out that the odds (which are all we care about) are, in fact, the same. This what we'd expect based on the general principle "the same data has only one interpretation".

But in case you're still worried, here's a way to help your intuitions get to the right thing. Fix some arbitrary sequence of patient outcomes with n=100 and r=70 which the second researcher could conceivably get. It's clear that the probability of following this sequence is the same for both researchers: this screens off the experimental procedure. So any particular sequence of patient outcomes gives the same odds, which means overall the odds are the same.

(There's also the sequences of outcomes that *only* the first researcher can get. But we don't worry about those because all n=100, r=70 outcomes are equally likely for the first researcher.)

## comment by mantis · 2012-09-10T18:19:53.681Z · LW(p) · GW(p)

*Old School statisticians thought in terms of tools, tricks to throw at particular problems.*

This reminds me of a joke posted on a bulletin board in the stats department at UC Riverside. It was part of a list of humorous definitions of statistical terms. For "confidence interval," it said that the phrase uses a particular, euphemistic meaning of the word "interval;" that meaning could be used to construct similar phrases such as "hat interval," "card interval," or "interval or treat."

## comment by alex_zag_al · 2012-12-10T15:19:28.058Z · LW(p) · GW(p)

And yet... should rationality be math? It is by no means a foregone conclusion that probability should be pretty. The real world is messy - so shouldn't you need messy reasoning to handle it?

And in a way, you do, even doing Bayesian statistics. The messiness is just in the actual numerical calculations, not in the definitions of the rules.

Suppose you're trying to find a good model for some part of the real world, and you've got your set of models you're considering. When you see data, and you use Bayes' Theorem to find the posterior probabilities, the expression is going to have in it P(data|m) for each model m, and P(data). And if your models are messy - necessary to represent a messy real world - then these are going to be complicated expressions that make for a really awful calculation.

So Bayesian statisticians have a toolbox for approximating posterior probabilities - Laplace approximation, Gibbs sampling, etc - to deal with *that* messiness.

(these are my impressions from a bit of reading of the Bayesian statistics literature - I've never actually *done* these things, I don't know what the expressions really *look* like)

## comment by nostalgebraist · 2012-12-24T10:20:20.911Z · LW(p) · GW(p)

*"Bayesianism's coherence and uniqueness proofs cut both ways. Just as any calculation that obeys Cox's coherency axioms (or any of the many reformulations and generalizations) must map onto probabilities, so too, anything that is not Bayesian must fail one of the coherency tests. This, in turn, opens you to punishments like Dutch-booking (accepting combinations of bets that are sure losses, or rejecting combinations of bets that are sure gains)."*

I've never understood why I should be concerned about dynamic Dutch books (which are the justification for conditionalization, i.e., the Bayesian update). I can understand how static Dutch books are relevant to finding out the truth: I don't want my description of the truth to be inconsistent. But a dynamic Dutch book (in the gambling context) is a way that someone can exploit the combination of my belief at time (t) and my belief at time (t+1) to get something out of me, which doesn't seem like it should carry over to the context of trying to find out the truth. When I want to find the truth, I simply want to have the best possible belief in the present -- at time (t+1) -- so why should "money" I've "lost" at time (t) be relevant?

Perhaps I simply want to avoid getting screwed in life by falling into the equivalents of Dutch books in real, non-gambling-related situations. But if that's the argument, it should depend on how frequently such situations actually crop up -- the mere existence of a Dutch book shouldn't matter if life is never going to make me take it. Why should my entire notion of rationality be based on avoiding one particular -- perhaps rare -- type of misfortune? On the other hand, if the argument is that falling for dynamic Dutch books constitutes "irrationality" in some direct intuitive sense (the same way that falling for static Dutch books does), then I'm not getting it.

## comment by nostalgebraist · 2012-12-27T17:51:10.895Z · LW(p) · GW(p)

Incidentally, Eliezer, I don't think you're right about the example at the beginning of the post. The two frequentist tests are asking distinct questions of the data, and there is not necessarily any inconsistency when we ask two different questions of the same data and get two different answers.

Suppose A and B are tossing coins. A and B both get the same string of results -- a whole bunch of heads (let's say 9999) followed by a single tail. But A got this by just deciding to flip a coin 10000 times, while B got it by flipping a coin until the first tail came up. Now suppose they each ask the question "what is the probability that, when doing what I did, one will come up with at most the number of tails I actually saw?"

In A's case the answer is of course very small; most strings of 10000 flips have many more than one tail. In B's case the answer is of course 1; B's method ensures that exactly one tail is seen, no matter what happens. The data was the same, but the questions were different, because of the "when doing what I did" clause (since A and B did different things). Frequentist tests are often like this -- they involve some sort of reasoning about hypothetical repetitions of the procedure, and if the procedure differs, the question differs.

If we wanted to restate this in Bayesian terms, we'd have to do so by taking into account that the interpreter knows what the method is, not just what the data is, and the distributions used by a Bayesian interpreter should take this into account. For instance, one would be a pretty dumb Bayesian if one's prior for B's method didn't say you'd get one tail with probability one. The observation that's causing us to update isn't "string of data," it's "string of data produced by a given physical process," where the process is different in the two cases.

(I apologize if this has all been mentioned before -- I didn't carefully read all the comments above.)

Replies from: JGWeissman## ↑ comment by JGWeissman · 2012-12-27T18:28:46.682Z · LW(p) · GW(p)

Now suppose they each ask the question "what is the probability that, when doing what I did, one will come up with at most the number of tails I actually saw?"

That is throwing away data. The evidence that they each observed is the sequence of coin flip results, and the number of tails in that sequence is a partial summary of the data. The reason they get different answers is because that summary throws away more data for B than A. As you say, B already expected to get exactly one tail, so that summary tells him nothing new and he has no information to update on, while A can recover from this summary the number of heads and only loses information about the order (which cancels out anyways in the likelihood ratios between theories of independent coin flips). But if you calculate the probability that they each see that sequence you get the same answer for both, p(heads)^9999 * (1 - p(heads).

That is, the data gathering procedure is needed to interpret a partial summary of the data, but not the complete data.

Replies from: nostalgebraist## ↑ comment by nostalgebraist · 2012-12-27T22:43:08.175Z · LW(p) · GW(p)

Sure, the likelihoods are the same in both cases, since A and B's probability distributions assign the same probability to any sequence that is in both of their supports. But the distributions are still different, and various functionals of them are still different -- e.g., the number of tails, the moments (if we convert heads and tails to numbers), etc.

If you're a Bayesian, you think any hypothesis worth considering can predict a whole probability distribution, so there's no reason to worry about these functionals when you can just look at the probability of your whole data set given the hypothesis. If (as in actual scientific practice, at present) you often predict functionals but not the whole distribution, then the difference in the functionals matters. (I admit that the coin example is too basic here, because in any theory about a real coin, we really would have a whole distribution.)

My point is just that there are differences between the two cases. Bayesians don't think these differences could possibly matter to the sort of hypotheses they are interested in testing, but that doesn't mean that in principle there can be no reason to differentiate between the two.

## comment by Anders_H · 2014-04-27T21:06:28.592Z · LW(p) · GW(p)

I believe the example in this post is fundamentally flawed. Some of the other commenters have hinted at the reasons, but I want to add my own thoughts on this.

Before we go into the difference between the frequentist and the Bayesian approach to the problem, we first have to be clear about whether the investigators acknowledge publicly that they use different stopping rules. I am going to cover both cases.

If the stopping rule is not publicly acknowledged, the frequentist data analyst can not take it into account. He will therefore have to use the same tests on the two datasets. Therefore, if experiment one rejects the null hypothesis, so will experiment two.

If the stopping rule is known to the public, a frequentist statistician will appropriately take this into account in his data analysis. As Eliezer says, experiment one may be statistically significant and experiment two may be non-significant. And this is completely appropriate; any rational Bayesian would do essentially the same thing:

Let "A" be the event that somebody publishes a study showing that the drug works in 60% of people. Let "B" be the prior probability that the drug works in at least 60% of people, and let "C" be the biased stopping rule.

In the case of the first investigator, the appropriate likelihood ratio is Pr(A | B) / Pr(A| not B)

In the case of the second investigator, the appropriate likelihood ratio is Pr(A| B, C) / Pr(A| C, not B)

Pr(A | B) / Pr(A| not B) is strictly larger than Pr(A| B, C) / Pr(A| C, not B)

Any Bayesian agent who uses Pr(A | B) / Pr(A| not B) in the case of the second investigator, is throwing away important evidence, namely that a biased stopping rule was applied.

## comment by Jacob Falkovich (Jacobian) · 2015-07-02T17:23:15.469Z · LW(p) · GW(p)

You know what really helps me accept a counterintuitive conclusion? Doing the math. I spent an hour reading and rereading this post and the arguments without being fully convinced of Eliezer's position, and then I spent 15 minutes doing the math (R code attached at the end). And once the math came out in favor of Eliezer, the conclusion suddenly doesn't seem so counterintuitive :)

Here we go, I'm diving all the numbers by five to make the code work but it's pretty convincing either way.

- The setup - Researcher A does 20 trials always, researcher B keeps doing trials until the ratio of cures is at least 70% (1 cure / 1 trial is also acceptable).
- E - The full evidence, namely that 20 patients were tried and 14 were cured.
- H0 - The hypothesis that the success rate of the cure is 60%.
- H1 - The hypothesis that the success rate is 70%.
- Pa - Researcher A's probabilities.
- Pb - Researcher B's probabilities.

In this setup, it's clear to see that Pa and Pb aren't equal for every thing you want to measure. For example, for any evidence E* that doesn't contain 20 observations Pa(E*)=0. However, Reverend Bayes reminds us that the strength of our EVIDENCE depends on the odds ratio, and not on all the sub probabilities:

P(H1|A) / P(H0|B) = P(H1)/P(H0) * P(E|H1)/P(E|H0) aka posterior odds = prior odds * odds ratio of evidence. Assuming that the prior odds are the same, let's calculate the odds ratio for both Pa and Pb and see if they are different.

Pa(E|H0) = 12.4%, as a simple binomial distribution: dbinom(14,20,0.6). Pa(E|H1) = 19.1%. The odds ratio: Pa(E|H1)/Pa(E|H0) = 1.54. That's the only measure of how much our posterior should change. If originally we gave each hypothesis an equal chance (1:1), we now favor H1 at a ratio of 1.54:1. In terms of probability, we changed our credence in H1 from 50% to 60.6%.

What about researcher B? I simulated researcher B a million times in each possible world, the H0 world and the H1 world. In the H0 world, evidence E occurred only 5974 times out of a million, for Pb(E|H0) = 0.597% which is very far from 12.4%. It makes sense: researcher 2 usually stops after the first trial, and occasionally goes on for zillions! What about the H1 world? Pb(E|H1) = 0.919%. The odds ratio: Pb(E|H1) / Pb(E|H0) = wait for it = 1.537. Exactly the same!

I think all the other posts explain quite well why this was obviously the case, but if you like to see the numbers back up one side of an argument, you got 'em. I personally am now converted, amen.

R code for simulating a single researcher B:

resb<-function(p=0.6){

cures<-0

tries<-0

while(tries < 21) { # Since we only care whether B stops after 20 trials, we don't need to simulate past 21.

```
tries<-tries+1
cures<-cures+rbinom(1,1,p)
if((cures/tries) >= 0.7) return(tries)
```

}

tries }

R code for simulating a million researchers B in H1 world:

x<-sapply(1:1000000,function(i) {resb(0.7)})

length(x[x==20])

## comment by vasaka · 2017-04-25T05:53:47.117Z · LW(p) · GW(p)

This example needs to be refined, two experiments do not produce the same data, second one has different probability space and additional data point - stopping position, and computing probabilities you should also condition on that stopping point N, fact that this N is screened by other data is nontrivial and waiving it just on assumption of beauty could lead to mistake.

It turned out that in this case that is correct move, but could be a mistake quite easily.

## comment by Liam Goddard · 2019-05-22T22:04:23.526Z · LW(p) · GW(p)

The two experiments would differ. In Experiment 1, we now have received evidence of a 70% probability of a cure. However, Experiment 2 doesn't offer the same evidence, because it will stop as soon as it gets significantly over 60%. Based on the randomness of results, it will not always fit the true probability. If the real probability was 70%, wouldn't it have most likely gotten up to 70% with 7 out of 10, or 14 out of 20? For most of Experiment 2, less than 60% of the patients were cured. The fact that by 100 patients it happened to go up was most likely an error in the data, and if the experiment was continued it would probably drop back below 60%.

## comment by Ian Televan · 2021-03-18T12:31:56.457Z · LW(p) · GW(p)

Fixing my predictions now, before going to investigate this issue further (I have Mackay's book within the hand's reach and would also like to run some Monte-Carlo simulations to check the results; going to post the resolution later):

a) It seems that we ought to treat the results differently, because the second researcher in effect admits to p-hacking his results. b) But on the other hand, what if we modify the scenario slightly: suppose we get the results from both researchers 1 patient at a time. Surely we ought to update the priors by the same amount each time? And so by the time we get the 100th individual result from each researcher, the priors should be the same, even if we then find out that they had different stopping criteria.

My prediction is that argument a) turns out to be right and argument b) contains some subtle mistake.

Replies from: Ian Televan## ↑ comment by Ian Televan · 2021-03-21T21:45:53.855Z · LW(p) · GW(p)

Update: a) is just wrong and b) is right, but unsatisfying because it doesn't address the underlying intuition which says that the stopping criterion ought to matter. I'm very glad that I decided to investigate this issue in full detail and run my own simulations instead of just accepting some general principle from either side.

MacKay presents it as a conflict between frequentism vs bayesianism and argues why frequentism is wrong. But I started out with a bayesian model and still felt that motivated stopping would have some influence. I'm going to try to articulate the best argument why the stopping criterion must matter and then explain why it fails.

First of all the scenario doesn't describe exactly what the stopping criterion was. So I made up one: The (second) researcher treats patients and gets the results one at a time. He has some particular threshold for the probability that the treatment is >60% effective and he is going to stop and report the results the moment the probability reaches the threshold. He derives this probability by calculating a beta distribution for the data and integrating it from 0.6 to 1. (for those who are unfamiliar with the beta distribution, I recommend this excellent video by 3Blue1Brown) In this case the likelihood of seeing the data given underlying probability is given by beta , and the probability that treatment is >60% effective is .

Now the argument: motivated stopping ensures that we don't just get 70 successes and 30 failures. We have an additional constraint that after each of the 99 outcomes for treatment the probability is strictly and only after the 100th patient it reaches . Surely then, we must modify to reflect this constraint. And if the true probability was really >60%, then surely there are many Everett branches where the probability reaches *before* we ever get to the 100th patient. If it really took so long, then it must be because it's actually less likely that the true probability is >60%.

And indeed, the likelihood of seeing 70 successes and 30 failures with such stopping criterion is less than is initially given by . BUT! The constraint is independent of the probability ! It is purely about the order in which the outcomes appear. In other words, it changes the constant , which originally indicated the total number of all different ways to order 70 positive and 30 negative instances. And this constant reduces the likelihood for *every* probability *equally*! It doesn't reduce it more in universes where compared to where . This means that the *shape* of the original distribution stays the same, only the *amplitude* changes. But because we condition on seeing 70 successes and 30 failures anyway, this means that the area under the curve must be equal to 1. So we have to re-normalize , and it comes out as again!

Another way to think about it is that the stopping criterion is not entangled with the actual underlying probability in a given universe. There is zero mutual information between the stopping criterion and . And yes, if this was * not *the case, if for example, the researcher had decided that he would also treat one more patient after reaching the threshold and only publish the results if this patient recovered (but not mention them in the report), then it would absolutely affect the results, because a positive outcome for the patient is more likely in universes where . But then it also wouldn't be purely about his state of mind, we would have an additional data point.