# Frequentist Statistics are Frequently Subjective

post by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2009-12-04T20:22:21.245Z · score: 71 (72 votes) · LW · GW · Legacy · 82 commentsAndrew Gelman recently responded to a commenter on the Yudkowsky/Gelman diavlog; the commenter complained that Bayesian statistics were *too subjective *and *lacked rigor*. I shall explain why this is unbelievably ironic, but first, the comment itself:

However, the fundamental belief of the Bayesian interpretation, that all probabilities are subjective, is problematic -- for its lack of rigor... One of the features of frequentist statistics is the ease of testability. Consider a binomial variable, like the flip of a fair coin. I can calculate that the probability of getting seven heads in ten flips is 11.71875%... At some point a departure from the predicted value may appear, and frequentist statistics give objective confidence intervals that can precisely quantify the degree to which the coin departs from fairness...

Gelman's first response is "Bayesian probabilities don't have to be subjective." Not sure I can back him on that; probability is ignorance and ignorance is a state of mind (although indeed, some Bayesian probabilities can correspond very directly to observable frequencies in repeatable experiments).

My own response is that frequentist statistics are *far* more subjective than Bayesian likelihood ratios. Exhibit One is the notion of "statistical significance" (which is what the above comment is actually talking about, although "confidence intervals" have almost the same problem). Steven Goodman offers a nicely illustrated example: Suppose we have at hand a coin, which may be fair (the "null hypothesis") or perhaps biased in some direction. So lo and behold, I flip the coin six times, and I get the result TTTTTH. Is this result statistically significant, and if so, what is the p-value - that is, the probability of obtaining a result at least this extreme?

Well, that depends. Was I planning to *flip the coin six times,* and *count the number of tails?* Or was I planning to *flip the coin until it came up heads,* and *count the number of trials?* In the first case, the probability of getting "five tails or more" from a fair coin is 11%, while in the second case, the probability of a fair coin requiring "at least five tails before seeing one heads" is 3%.

Whereas a Bayesian looks at the experimental result and says, "I can now calculate the likelihood ratio (evidential flow) between all hypotheses under consideration. Since your state of mind doesn't affect the coin in any way - doesn't change the probability of a fair coin or biased coin producing this exact data - there's no way your private, unobservable state of mind can affect my interpretation of your experimental results."

If you're used to Bayesian methods, it may seem difficult to even imagine that the statistical interpretation of the evidence ought to depend on a factor - namely the experimenter's state of mind - which has no causal connection whatsoever to the experimental result. (Since Bayes says that evidence is about correlation, and no systematic correlation can appear without causal connection; evidence requires entanglement.) How can frequentists manage even in *principle* to make the evidence depend on the experimenter's state of mind?

It's a complicated story. Roughly, the trick is to make yourself artificially ignorant of the data - instead of knowing the *exact* experimental result, you pick a class of possible results which *includes* the actual experimental result, and then pretend that you were told *only* that the result was somewhere in this class. So if the actual result is TTTTTH, for example, you can pretend that this is part of the class {TTTTTH, TTTTTTH, TTTTTTTH, ...}, a class whose total probability is 3% (1/32). Or if I preferred to have this experimental result *not* be statistically significant with p < 0.05, I could just as well pretend that some helpful fellow told me only that the result was in the class {TTTTTH, HHHHHT, TTTTTTH, HHHHHHT, ...}, so that the total probability of the class would be 6%, n.s. (In frequentism this question is known as applying a "two-tailed test" or "one-tailed test".)

The arch-Bayesian E. T. Jaynes ruled out this sort of reasoning by telling us that a Bayesian ought only to condition on events that actually happened, not events that could have happened but didn't. (This is not to be confused with the dog that doesn't bark. In this case, the dog was *in fact* silent; the silence of the dog happened in the real world, not somewhere else. We are rather being told that a Bayesian should not have to worry about alternative possible worlds in which the dog *did* bark, while estimating the evidence to take from the *real* world in which the dog did *not* bark. A Bayesian only worries about the experimental result that was, in fact, obtained; not other experimental results which could have been obtained, but weren't.)

The process of throwing away the actual experimental result, and substituting a class of possible results which contains the actual one - that is, deliberately losing some of your information - introduces a dose of *real* subjectivity. Colin Begg reports on one medical trial where the data was variously analyzed as having a significance level - that is, probability of the "experimental procedure" producing an "equally extreme result" if the null hypothesis were true - of p=0.051, p=0.001, p=0.083, p=0.28, and p=0.62. Thanks, but I think I'll stick with the conditional probability of the *actual* experiment producing the *actual* data.

Frequentists are apparently afraid of the possibility that "subjectivity" - that thing they were accusing Bayesians of - could allow some unspecified terrifying abuse of the scientific process. Do I need to point out the general implications of being allowed to throw away your actual experimental results and substitute a class you made up? In general, if this sort of thing is allowed, I can flip a coin, get 37 heads and 63 tails, and decide that it's part of a *class *which includes all mixtures with at least 75 heads *plus* this exact particular sequence. As if I only had the output of a fixed computer program which was written in advance to look at the coinflips and compute a yes-or-no answer as to whether the data is in that class.

Meanwhile, Bayesians are accused of being "too subjective" because we might - gasp! - assign the wrong prior probability to something. First of all, it's obvious from a Bayesian perspective that science papers should be in the business of reporting likelihood ratios, not posterior probabilities - likelihoods add up across experiments, so to get the latest posterior you wouldn't *just* need a "subjective" prior, you'd also need all the *cumulative* evidence from other science papers. Now, this accumulation might be a *lot* more straightforward for a Bayesian than a frequentist, but it's not the sort of thing a typical science paper should have to do. Science papers should report the likelihood ratios for any popular hypotheses - but above all, make the actual raw data available, so the likelihoods can be computed for *any* hypothesis. (In modern times there is absolutely no excuse for not publishing the raw data, but that's another story.)

And Bayesian likelihoods *really are *objective - so long as you use the actual exact experimental data, rather than substituting something else.

Meanwhile, over in frequentist-land... what if you told everyone that you had done 127 trials because that was how much data you could afford to collect, but* really* you kept performing more trials until you got a p-value that you liked, and then stopped? Unless I've got a bug in my test program, a limit of up to 500 trials of a "fair coin" would, 30% of the time, arrive on some step where you could stop and reject the null hypothesis with p<0.05. Or 9% of the time with p<0.01. Of course this requires some degree of scientific dishonesty... or, perhaps, some minor confusion on the scientist's part... since if this is what you are *thinking,* you're supposed to use a *different* test of "statistical significance". But it's not like we can actually look inside their heads to find out what the experimenters were thinking. If we're worried about scientific dishonesty, surely we should worry about *that?* (A similar test program done the Bayesian way, set to stop as soon as finding likelihood ratios of 20/1 and 100/1 relative to an alternative hypothesis that the coin was 55% biased, produced false positives of 3.2% and 0.3% respectively. Unless there was a bug; I didn't spend that much time writing it.)

The *actual* subjectivity of standard frequentist methods, the ability to manipulate "statistical significance" by choosing different tests, is not a minor problem in science. There are ongoing scandals in medicine and neuroscience from lots of "statistically significant" results failing to replicate. I would point a finger, not just at publication bias, but at scientists armed with powerful statistics packages with lots of complicated tests to run on their data. Complication is really dangerous in science - unfortunately, it looks like instead we have the social rule that throwing around big fancy statistical equations is highly prestigious. (I suspect that some of the opposition to Bayesianism comes from the fact that Bayesianism is too simple.) The obvious fix is to (a) require raw data to be published; (b) require journals to *accept papers *before the *experiment* is performed, with the advance paper including a specification of what statistics were selected in advance to be run on the results; (c) raising the standard "significance" level to p<0.0001; and (d) junking all the damned overcomplicated status-seeking impressive nonsense of classical statistics and going to simple understandable Bayesian likelihoods.

Oh, and this frequentist business of "confidence intervals"? Just as subjective as "statistical significance". Let's say I've got a measuring device which returns the true value plus Gaussian noise. If I know you're about to collect 100 results, I can write a computer program such that, before the experiment is run, it's 90% probable that the true value will lie within the interval output by the program.

So I write one program, my friend writes another program, and my enemy writes a third program, all of which make this same guarantee. And in all three cases, the guarantee is true - the program's interval will indeed contain the true value at least 90% of the time, if the experiment returns the true value plus Gaussian noise.

So you run the experiment and feed in the data; and the "confidence intervals" returned are [0.9-1.5], [2.0-2.2], and ["Cheesecake"-"Cheddar"].

The problem may be made clearer by considering the third program, which works as follows: 95% of the time, it does standard frequentist statistics to return an interval which will contain the true value 95% of the time, and on the other 5% of the time, it returns the interval ["Cheesecake"-"Cheddar"]. It is left as an exercise to the reader to show that this program will output an interval containing the true value at least 90% of the time.

BTW, I'm pretty sure I recall reading that "90% confidence intervals" as published in journal papers, in those cases where a true value was later pinned down more precisely, did *not* contain the true value 90% of the time. So what's the point, even? Just show us the raw data and maybe give us a summary of some likelihoods.

Parapsychology, the control group for science, would seem to be a thriving field with "statistically significant" results aplenty. Oh, sure, the effect sizes are minor. Sure, the effect sizes get even smaller (though still "statistically significant") as they collect more data. Sure, if you find that people can telekinetically influence the future, a similar experimental protocol is likely to produce equally good results for telekinetically influencing the past. Of which I am less tempted to say, "How amazing! The power of the mind is not bound by time or causality!" and more inclined to say, "Bad statistics are time-symmetrical." But here's the thing: Parapsychologists are constantly protesting that they are playing by all the standard scientific rules, and yet their results are being ignored - that they are unfairly being held to higher standards than everyone else. I'm willing to believe that. It just means that the *standard* statistical methods of science are so weak and flawed as to permit a field of study to sustain itself in the complete absence of any subject matter. With two-thirds of medical studies in prestigious journals failing to replicate, getting rid of the entire actual subject matter would shrink the field by only 33%. We have to raise the bar high enough to exclude the results claimed by parapsychology under classical frequentist statistics, and then fairly and evenhandedly apply the *same* bar to the rest of science.

Michael Vassar has a theory that when an academic field encounters advanced statistical methods, it becomes really productive for ten years and then bogs down because the practitioners have learned how to game the rules.

For so long as we do not have infinite computing power, there may yet be a place in science for non-Bayesian statistics. The Netflix Prize was not won by using strictly purely Bayesian methods, updating proper priors to proper posteriors. In that acid test of statistical discernment, what worked *best* was a gigantic ad-hoc mixture of methods. It may be that if you want to get the most mileage out of your data, in this world where we do not have infinite computing power, you'll have to use some ad-hoc tools from the statistical toolbox - tools that throw away some of the data, that make themselves artificially ignorant, that take all sorts of steps that can't be justified in the general case and that are potentially subject to abuse and that will give wrong answers now and then.

But don't do that, and *then *turn around and tell me that - of all things! - Bayesian probability theory is *too subjective. *Probability theory is the *math *in which the results are *theorems *and every theorem is compatible with every other theorem and you never get different answers by calculating the same quantity in different ways. To resort to the ad-hoc variable-infested complications of frequentism while preaching your *objectivity? * I can only compare this with the politicians who go around preaching "Family values!" and then get caught soliciting sex in restrooms. So long as you deliver loud sermons and make a big fuss about painting yourself with the right labels, you get identified with that flag - no one bothers to look very hard at what you *do.* The case of frequentists calling Bayesians "too subjective" is worth dwelling on for that aspect alone - emphasizing how important it is to look at what's actually going on instead of just listening to the slogans, and how rare it is for anyone to even glance in that direction.

## 82 comments

Comments sorted by top scores.

I really like the idea of parapsychology as the control group for science; it deserves to be better known.

Yeah, that was a pretty clever turn of phrase.

According to A New Challenge to Einstein, General Relativity has been refuted at 98% confidence.

I wonder if it wouldn’t be more accurate to say that, actually, 98% confidence has been refuted at General Relativity.

This book has done a lot to help astronomy out, but much of the field is still lost in bad statistics. I ran some papers by a statistician and he agreed with me, the authors had no idea what they were doing and neither did their referees.

But it's getting better.

And while on the subject of confidence values and misuse of statistics in science, this should prove an interesting read: http://golem.ph.utexas.edu/category/2010/09/fetishizing_pvalues.html

http://www.newscientist.com/article/mg20427375.700 'nuff said

I know you said "'nuff said", but could you elaborate on this?

The only reason to do the "don't use all available data when formulating your hypothesis, so you have additional data to test it on" thing suggested in the article, is that you're sufficiently irrational that more data can hurt you. And of course, this **does** happen in science; the most obvious failure case is probably overfitting.

If you look at the data rationally (or even just with the best approximation you can get for a certain amount of computing resources), there is nothing at all you can do when getting it in small pieces that you couldn't also do if you got it all at once.

If getting the data all at once hurts you in any way, then you're either approaching the problem wholly wrongly, or your priors are broken (for instance because they don't give sufficiently greater probability mass to simpler hypotheses than more complicated ones).

If getting the data all at once hurts you in any way, then you're either approaching the problem wholly wrongly, or your priors are broken

Well, it should be noted that, if a theory based on a subset of the data predicts the whole data set, then that theory has a higher probability of being correct than a theory based on the whole data set.

But of course it's harder to construct such an effective theory from a subset of the date. It might be so much harder that the higher probability that such a theory would have isn't worth the effort.

Well, it should be noted that, if a theory based on a subset of the data predicts the whole data set, then that theory has a higher probability of being correct than a theory based on the whole data set.

But that's exactly because you don't trust the scientist who came up with the hypothesis looking at the whole data set to discount correctly for the complexity of their hypothesis. This might happen either because you think they're irrational, or because you're worried about intellectual dishonesty - though in the latter case you should alwo worry about the scientist with the allegedly limited-data theory having snuck a peek at the full set, or just having come up with enough overly specific theories that one of them was likely to survive the follow-up test.

As the comments to that post say, if you can actually look at the hypotheses in question, and you're completely confident in your own judgement of simplicity, that judgement completely screens off how much data was used in formulating them.

The idea behind the scientific method is to design procedures that are robust to the scientist being biased or incompetent or even corrupt. Any approach that starts with "assume a perfect scientist" is not going to work in reality.

Science is a set of hacks to get usable modelling out of humans, accepting that

- There are things that humans do which are critical to modelling reality, and which you do not understand to the point of being able to reimplement them, but
- you also can't just leave humans to do free-form theorizing, because that has been conclusively shown to lead to all kinds of problems.

The critical black box in this specific case is about how to judge a theory's simplicity, and what the best way to build a prior from that is. As long as either of these things is a black box to you, you won't be able to do much better than using high-level heuristical hacks of the sort science is made out of. But that's going to bite you every time you don't have the luxury of being able to apply these hacks - say because you're modelling (some aspect of) human history, and can't rerun the experiment. Also, you won't be able to build an AGI.

In addition, if you're really worried about corruption, the holding-back-data-on-purpose thing is setting up great profits to be made this way:

- Corrupt scientist takes out a loan for BIGNUM $.
- Corrupt scientist pays this money to someone with access to the still-secret data.
- Bribed data keeper gives corrupt scientist a copy of the data.
- Corrupt scientist fits their hypothesis to the whole data set.
- Corrupt scientist publishes hypothesis.
- Full data set is released officially.
- Hypothesis of corrupt scientist is verified to match whole data set. Corrupt scientist gains great prestige, and uses that to obtain sufficient money to pay off the loan from 1, and then some.

You could try to set up the data keeper organization so that a premature limited data release is unlikely even in the face of potentially large bribes, but that seems like a fairly tough problem (and are they even thinking about it seriously?). Data is very easy to copy, preventing it from being copied is hard. And in this case, more so than in most cases where you're worried about leaks, figuring out that a leak has in fact happened might be extremely difficult - at least if you really are ignorant about what hypothesis simplicity looks like.

But that's going to bite you every time you don't have the luxury of being able to apply these hacks - say because you're modelling (some aspect of) human history, and can't rerun the experiment.

? History sounds like exactly the situation where "hold back half the data, hypothsise on the other half, then look at the whole" is the only way of reasonably going about this.

Also, you won't be able to build an AGI.

Don't follow that argument at all - in the worst case scenario, you can brute force it by scanning and moddelling a human brain. But even if true, it's not really an issue for social scientists and their ilk. And there the "look at half the data" would cause definite improvements in their proceedures. It would make science work for the "flawed but honest" crowd.

As for deliberately holding back half the data from other scientists (as opposed to one guy simply choosing to only look at half), that's a different issue. I've got no really strong feelings on that. It could go either way.

It's an ok hack for someone in the "flawed but honest" crowd, individually. But note that it really doesn't scale to allowing you to deal with corruption (which was one of the problems I assumed in the post you replied to).

Extended to an entire field, this means that you may end up with N papers, all about the same data set, all proposing a different hypothesis that produces a good match on the set, and all of them claiming that their hypothesis was formulated using this procedure. IOW, you end up with unverifiable "trust us, we didn't cheat" claims for each of those hypotheses. Which is not a good basis for arriving at a consensus in the field.

Re AI design, assuming you actually understand what you implemented (as opposed to just blindly copying algorithms from the human brain without understanding what they do), the reason this method would work is that you've successfully extracted the human built-in simplicity prior (and I don't know how good that one is exactly, but it has to be a halfway workable approximation; otherwise humans couldn't model reality at all).

As the comments to that post say, if you can actually look at the hypotheses in question, and you're completely confident in your own judgement of simplicity, that judgement completely screens off how much data was used in formulating them.

I agree that it wouldn't matter how much data we gave the scientists *if they had fixed a method for turning data into a theory beforehand*.

And I agree that such a method should settle on the simplest theory among all candidates. It should implement Occam's razor.

*But we shouldn't expect the scientists to fix such a method before seeing the data.* Occam's razor is not enough. You first have to have a computationally feasible way to generate good candidate theories from which you choose the simplest one. And we have every reason to expect that cosmologists will eventually come up with better methods for turning cosmological data into good candidate theories. Therefore, it doesn't make sense to force the cosmologists to bind themselves to a method now. They need the freedom to discover better methods than any that they've yet found.

The requirement of "computational feasibility" means that we can expect to have several candidate methods with no *a priori* way to judge confidently that one is better than the other. We will need recourse to empirical observations to compare the methods.

In this comment of mine to the post linked above, I showed that if a method produces a theory that predicts the whole data set from a subset, then that method is probably superior to a method that uses the whole data set. The proof goes through even if we assume that each method has a step where it applies Occam's razor:

Define a

methodto be a map that takes in a batch of evidence and returns a theory. We have two assumptionsASSUMPTION 1: The theory produced by giving an input batch to a method will at least predict that input. That is, no matter how flawed a method of theory-construction is, it won't contradict the evidence fed into it. More precisely,

p( M(B) predicts B ) = 1.

[...]

ASSUMPTION 2: If a method M is known to be flawed, then its theories are less likely to make correct predictions of future observations. More precisely, if B2 is not contained in B1, then

p( M(B1) predicts B2 | M flawed ) < P( M(B1) predicts B2 ).

Now, let B1 and B2 be two disjoint and nonempty sets of input data. [E.g., two subsets of the cosmological data whose union is the whole data set.]

[...] Let

P1 := p( M is flawed | M(B1) predicts B2 ),

P2 := p( M is flawed | M(B1 union B2) predicts B2 ).

Then P1 < P2[.]

(See the comment for a proof.)

And I agree that such a method should settle on the simplest theory among all candidate theories. It should implement Occam's razor.

It's not quite that simple in practice. There's a tradeoff here, between accuracy in retrospect and theory simplicity. The two extreme pathological cases are:

You demand absolute accuracy in retrospect, i.e. P(observed data | hypothesis) = 1. This is the limit case of overfitting, and yields a GLUT, which makes no or completely useless predictions about the future.

You demand maximum simplicity. This is the limit case of underfitting, and yields a maximum-entropy distribution.

You want something inbetween those cases. I don't know where exactly, but you would have to figure out some way to determine that point if you were, say, building an AGI.

I can't really follow your earlier post. Specifically, I can't parse your use of " predicts ", which you seem to use as a boolean value. But theories don't "predict" or "not predict" outcomes in any absolute sense, they just assign probabilities to outcomes. Please explain your use of the phrase.

I can't really follow your earlier post. Specifically, I can't parse your use of " predicts ", which you seem to use as a boolean value. But theories don't "predict" or "not predict" outcomes in any absolute sense, they just assign probabilities to outcomes. Please explain your use of the phrase.

Sorry, the earlier post was in the context of a toy problem in which predictions were boolean. I should have mentioned that. (I had made this assumption explicit in an earlier comment.)

My argument shows that, in the limiting case of boolean predictions, we should trust successful theories constructed using a subset of the data over theories constructed using all the data, even if all the theories were constructed using Occam's razor. This at least strongly suggests the same possibility in more realistic cases where the theories assign probability distributions.

Ok, I think I get your earlier post now. I think you might be overcomplicating things here.

Sure, if you're not confident what the correct simplicity prior is, you can get real evidence about which theory is likely to be stronger by observing things like their ability to correctly predict the outcome of new experiments. And to the extent that this tells you something about the *way* the originating scientist generates theories, there should even be some shifting of probability mass regarding the power of other theories proiduced by the same scientist. But that's quite a lot of indirection, and there's significant unknown factors that will dilute these shifts.

Attempting this is somewhat like trying to estimate the probability of a scientist being right about a famous problem in their field based on their prestige. There's a signal, but it's quite noisy.

If you know what simplicity looks like (and of course that's uncomputable, but you can always approximate) - and how much it's worth in terms of probability mass - you can make a much better guess as to which hypothesis is stronger by just looking at the actual hypotheses.

Looking at things like "how many experimental results did this hypothesis actually predict correctly" is only informative to the extent that your understanding of simplicity and its value is lacking. Note that the phrase *lacking understanding of simplicity* isn't meant to be especially disparaging; good understanding of simplicity is hard. There's a reason the scientific process includes an inelegant workaround instead.

Damn. That is some wrong-headed thinking right there. Never mind the likelihood principle -- have these people never heard of cross-validation?

I don't follow; could you explain? It seems like a decent idea to me.

The fear seems to be that people might propose theories with enough degrees of freedom that they can fine-tune it to fit the complete data very closely. But, as long as the fitting process is repeatable, i.e., no numerology, cross-validation can be applied to discover which theories are genuinely predictive and which are over-fitting.

OK; so I'm looking at http://en.wikipedia.org/wiki/Cross-validation_%28statistics%29 and the key nugget seems to be:

we then take an independent sample of validation data from the same population as the training data, it will generally turn out that the model does not fit the validation data as well as it fits the training data. This is called overfitting, and is particularly likely to happen when the size of the training data set is small, or when the number of parameters in the model is large. Cross-validation is a way to predict the fit of a model to a hypothetical validation set when an explicit validation set is not available.

But how can we come up with a hypothetical validation set for the Universe?

Some of the lines seem to suggest exactly what those researchers seem to propose - not allowing all of the observations to go into the theory, but like the NetFlix Prize, holding back some data as a test:

One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set).

In K-fold cross-validation, the original sample is randomly partitioned into K subsamples. Of the K subsamples, a single subsample is retained as the validation data for testing the model, and the remaining K − 1 subsamples are used as training data.

As the name suggests, leave-one-out cross-validation (LOOCV) involves using a single observation from the original sample as the validation data, and the remaining observations as the training data.

But how can we come up with a hypothetical validation set for the Universe?

Releasing the data in dribs and drabs doesn't address this either.

Some of the lines seem to suggest exactly what those researchers seem to propose - not allowing all of the observations to go into the theory, but like the NetFlix Prize, holding back some data as a test:

There's a difference between, on the one hand, having the data freely available and being intelligent enough to use cross-validation, and on the other, having someone paternalistically hold the data back from you.

Releasing the data in dribs and drabs doesn't address this either.

It does *force* researchers into an ad hoc cross-validation scheme, doesn't it?

There's a difference between, on the one hand, having the data freely available and being intelligent enough to use cross-validation, and on the other, having someone paternalistically hold back the data from you.

If you start from the premise that researchers may fall into the overfitting trap, then you're already treating them adversarily. And if just one researcher overfitting a theory and so becoming irrefutable will screw everything up, then the paranoid approach to data release prevents that total cockup (at the cost of some interim inefficiencies, by hindering the responsible, good, researchers).

And if just one researcher overfitting a theory and so becoming irrefutable will screw everything up, then the paranoid approach to data release prevents that total cockup

It doesn't prevent *that* entirely reliably, either. How much time are you going to give the researchers to come up with hypotheses before you release the full set? And what do you do if someone comes up with a new hypothesis after the full release, so mindblowingly elegant and simple that it blows all of the previously published ones out of the water?

If you think that some later hypotheses based on the full set might still be accepted, then you're still vulnerable to falling into the overfitting trap after the full release. If you *don't*, then you'll be locked forever into the theories scientists came up with during the partial-release window, and no later advances in the scientific method, rationality, computing, math or even the intelligence of researchers will allow you to improve upon them.

This approach might get you some extra empirical evidence, but it will be empirical evidence about theories put together under quite limited conditions, compared to what will be available to later civilization.

That does look like a reasonable case of where it's okay to hide data, but it's mainly because of *human* shortcomings as rational agents. Since there are so few cases of new data coming in, we need to have scientists go on the record with a theory *before* they see the data, so they can't try to retroactively show how their theory is "really" consistent with the observed data.

Frequentists are apparently afraid of the possibility that "subjectivity" - that thing they were accusing Bayesians of - could allow some unspecified terrifying abuse of the scientific process. Do I need to point out the general implications of being allowed to throw away your actual experimental results and substitute a class you made up?

The demon you're describing can't be exorcised simply by switching from frequentism to Bayesianism - it torments the Bayesians as well. It is more an issue of intellectual honesty than of statistical paradigm.

A Bayesian falls into the trap by choosing a prior after the data is observed. Say you have a data set partitioned into a training set and a test set. You think the data is well described by a model class M. So you take the training set and run a learning algorithm, get some good parameters, and then use the learned model to make predictions on the test set. You fail. Hmmm, you say, back to the drawing board, and cook up a different model M'. You repeat the process with M' and voila! - accurate predictions on the test set. Time to declare victory, right? Wrong. You haven't proved anything, because you looked at the test data.

In my view, no one has really understood the true magnitude of this problem - no one has looked the demon in the eye without blinking. For example, on this page you can find a listing of the best results achieved on the MNIST handwritten digit benchmark. You can see how the results get better and better. But this improvement doesn't mean anything, because the selection is decided based on how well the algorithms do on the test data! In other words, *the machine learning community implements overfitting*.

I thought the Netflix Prize did a pretty good job of handling this.

What is the source for the claim that two thirds of scientific studies that are published in prestigious medical journals fail to replicate?

I thought I'd seen that before on OB, but I've looked through, and while I've found a *lot* of stuff about being skeptical of science & medicine in particular, not that exact figure:

- http://www.overcomingbias.com/2009/07/clinical-trial-sloppiness.html
- http://www.overcomingbias.com/2009/07/popular-fields-less-accurate.html
- http://www.overcomingbias.com/2008/12/it-is-simply-no-longer-possible-to-believe.html
- http://www.overcomingbias.com/2008/11/polisci-journal.html
- http://www.overcomingbias.com/2007/08/anonymous-revie.html
- http://www.overcomingbias.com/2007/01/conclusionblind.html
- http://www.overcomingbias.com/2007/01/supping_with_th.html
- http://www.overcomingbias.com/2006/12/academic_overco.html

I also remember seeing this on OB but was having trouble searching for some reason.

It's ironic that you use a number for which you don't know your source in an essay to recommend that public always release all their data.

Living by one's own standards is always hard when you are a skeptic ;)

These seem relevant:

http://www.overcomingbias.com/2007/09/false-findings.html

"Why Most Published Research Findings Are False" I don't think he gives a figure of 2/3, but he does say: "It can be proven that most claimed research findings are false."

"Why Most Published Research Findings Are False"

In fact, the origin of the statistic is an earlier paper by the same author, John PA Ioannidis, Contradicted and Initially Stronger Effects in Highly Cited Clinical Research. Journalistic coverage of the later paper quotes a description of the earlier:

“Five out of six epidemiological studies have been contradicted in a very short period of time,” says Ioannidis, “while about one out of three randomized clinical trials were also refuted.”

Note that this is not 1/3 of all randomized studies, but just the highly cited ones, the ones that have any chance of an attempt at replication (and still 25% have no published attempt). It is *not* obvious to me whether these studies will be higher or lower quality than average. ETA: actually, there is a "control group" of articles published in the same prestigious journals, but with fewer citations. They are contradicted a little less often, p<.1, and not because people aren't bothering to replicate them.

But there still quite a difference between claiming that 1/3 or 2/3 of randomized studies are wrong.

But there still quite a difference between claiming that 1/3 or 2/3 of randomized studies are wrong.

Yes, the 2/3 is probably an error on Eliezer's part. He doesn't say randomized, so he could be misremembering the 5/6, but the 5/6 is probably not the figure to quote and Ioannidis is a little misleading in putting it first. Only 6 of the 45 highly-cited studies were retrospective and I think that is representative of medical research in the prestigious journals. (the discussion of the control group should address this, but I don't see it)

Thanks. Since I can't find the original reference (if, indeed, it ever existed outside my imagination) these will do equally well to illustrate the point.

Would you mind publishing your test code?

Since OBLW went off Typepad, I don't have any simple way to dump files somewhere. Got a suggestion?

EDIT: Thanks Cyan.

http://wiki.lesswrong.com/wiki/User:Eliezer_Yudkowsky/CodeDump

(Note: Ugly code and if you want to actually check anything you should rewrite your own before looking at mine to minimize chance of error propagation.)

Google code is overkill, but it is one option. Another: make a descendant page of your LessWrongWiki user page and copy-n-paste it there.

git and GitHub are worth some attention for anyone who writes code. GitHub have a very simple Gist subdomain for code dumps, including syntax highlighting. Pastie and others offer similar free services.

www.4shared.com seemed to work pretty well when I used it.

It's a shame it took me so long to stumble on this post. Just wrote my own program to check your result, without looking at yours, and it seems you're right. In fact it seems to be between 30% and 40%.

Maybe the wiki system should have a special area for each article where supplementary material can be posted?

If you want medical researchers to publish all data they use in their experiments, you have to say good bye to privacy. If you have enough data about the individual patients who entered a study on AIDS drugs you will be able to data mine whether your friend Joe has been one of the patients in the study.

Are we willing to live in a world where all medical data about everyone is in the public domain? The trade offs are enormous but there isn't much broad discussion about the question in the mainstream.

And they transpose the conditional! If a sample is likely given the hypothesis, it does not necessarily follow that the the hypothesis is likely given the sample. This always struck me as the most egregious failure of naive significance testing.

Although this post is old, I really dislike the description of confidence levels. An appropriate confidence level is made out of the points with the "best" likelihoods: Every point in the point has a higher likelihood than every point outside. Therefore, an output of '["Cheesecake"-"Cheddar"]' is a clear sign of bad science. A good confidence level is just a summary of likelihood ratios. And for symmetric likelihood profiles (at least near the minimum), it gives a very good idea of the function in the interesting range.

And most problems with p-values can be reduced by choosing the study and analysis technique in advance (or with blinded data, for the analysis part). If you can flip the coin again, make a small number of hypotheses and test them. Writing down good prior probabilities after looking at coin tosses does not really work with human brains, and writing down 100000 different options (including things like "the coin gives always the repeating pattern TTTTTH") before doing coin tosses is even worse.

One aspect of the frequentist approach that I think should be mentioned is its compression of information, in regards its results.

This is prenicious for specialists, but for non-specialists it's a boon. Rather than carting around precise numerical data for every proposed theory (numerical data that we can never remember, as the uncertainty over the 2/3 non-replication figure shows - it gets even worse if we have to remember whole distributions), you simply need to remember a binary result: significant/not significant.

(Things would be even simpler if we got rid of the 95% significance level altogether).

I'd suggest that specialists should use bayesian methods in their works, but that their summaries and press releases should be in a frequentist format.

In modern times there is absolutely no excuse for not publishing the raw data, but that's another story.

One plausible counter-argument / limitation, is that researchers need incentives to collect data. Overall I agree with you. Here is sociologist Jeremy Freese on this issue: http://www.jeremyfreese.com/docs/freese-reproducibility-webdraft.pdf and http://www.jeremyfreese.com/docs/Freese%20-%20OpenSourceSocialScience%20-%20062807.pdf

See also "How to lie with statistics" , an oldie but goodie

http://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/0393310728

Colin Begg reports on one medical trial

Broken link there.

http://yudkowsky.net/rational/bayes - link error

Regarding frequentists' concerns about subjectivity in the Bayesian interpretation:

We should learn to be content with happiness instead of "true happiness", truth instead of "ultimate truth", purpose instead of "transcendental purpose", and morality instead of "objective morality". [1]

...and randomness instead of "true randomness". Mind the mind.

"In modern times there is absolutely no excuse for not publishing the raw data, but that's another story."

Nope. If, like many studies, you have data on real live humans there are perfectly-sane ethical and legal considerations which make publishing raw data a non-starter. Even publishing summaries is a problem; see the way the Wellcome Trust Case Control Studies recently hauled aggregate results of their website when it became clear individual diseases status could be ascertained from them.

Fair point.

There can also be difficulties with plain data size. Your average journal is not going to publish any tables of a few hundred GB-worth of data; and while you can certainly link to a place to download the files, how long are they going to sit there? It would be rather embarrassing if someone read your paper in ten years and your data server was no longer there.

The notion of changing your mind about the experimental procedure, and thereby changing the significance of the result, is a bit of a straw man. You establish the experimental procedure, then you run the test; your state of mind at the time you flipped the coin is a perfectly ordinary fact about the world, which can influence your priors in a nicely Bayesian way. Of course it's possible to cheat and lie about what your state of mind actually was, but that's not a problem of frequentist mathematics.

Publish online. Let the journals maintain redundant servers. Keep hard copies for back-up just in case. This is really simple.

And what happens in twenty years when the journal goes out of business?

Do you think that that's worse than the way things are done currently?

Bittorrent? You can publish shasums of the data sets in the paper so you know it is the data you are looking for.

Bittorrent specializes in short-term, spiky mass downloading. It's not so hot for the long tail of years or decades. How many large torrents are alive after a few years?

This is *exactly* the problem that archive.org was set up to deal with. They've been doing an excellent job of it, and their cost-per-gigabyte-month is only going to drop as storage and bandwidth become cheaper.

Yes, they have been doing an excellent job. I've donated to them more than once because I find myself using the IA on a nigh-daily basis.

But the IA is no panacea. It can only store some categories of content reliably, and the rest is inaccessible. Nor have I seen them hold & distribute the truly enormous datasets that much research will use - the biggest files I've seen the IA offer for public download are in the single gigabytes or hundreds of megabytes range.

And what happens in twenty years when the journal goes out of business?

We have the data for 20 years longer than we would have if they never published it. And, if someone else happens to care about the subject we will probably have copies remaining somewhere.

In twenty years the price of storage will have fallen enough that some library has no problem storing a copy. Let pubmed store the copy for medicine data.

Storage isn't the principal cost even now. Ask some librarians; copyright clearance, format migration, metadata maintenance, bandwidth, accessibility... all of these cost much more than mere rotating disks. (Especially since 2tb costs ~$100 in 2010, and costs will only go down - $70 in October 2011.)

Let's see if we can address each of these issues. Assume, as a baseline, that all this data gets hosted by archive.org and that storage costs are pretty much a non-issue in The Future.

**Copyright clearance.** I confess I'm not too familiar with copyright law, but shouldn't it be possible for scientists to grant permission to host and distribute their datasets forever? That sounds like it should streamline things.

**Format migration.** This can be mitigated by using human-readable formats wherever possible. Fifty years from now people may not be able to read PDF files, but they'll definitely be able to handle simple CSV data plus a README file describing the format. And XML is easy enough to puzzle out. (I resurrected a 20-year-old boolean logic synthesis program, and the data formats had not changed in that time. Plain text is easy, and it handles a lot.)

**Metadata maintenance.** The only metadata we'll really *need* for these is a unique id that can be included in papers which wish to cite the dataset. Once you have that, hosting can be as simple as sticking a bunch of .tar.gz files in a directory and setting up a web server. I could do it in ten minutes. If you *want* more elaborate metadata, go right ahead, but remember that it's the icing on the cake.

**Bandwidth.** Getting ever cheaper as we figure out how to push more data over the same fiber.

**Accessibility.** You don't need to jump through hoops to make it accessible to the blind or something; simply having the raw data and an easy-to-use web site would be plenty accessible. Even if your web site redirected every tenth visitor to Goatse.cx, it would be better than the current situation.

We don't have to make this perfect; we just have to build a base that works, and which can interoperate with other web sites. Just publicly hosting tarballs of the datasets, each one given a permanent unique identifier which is included in any papers citing it, would be fantastic.

http://opendatasets.archive.org/dataset/herlihy-moss-htm-1993.tar.gz

See that? It's not a real URL, but I wish it were.

I confess I'm not too familiar with copyright law, but shouldn't it be possible for scientists to grant permission to host and distribute their datasets forever?

Scientists cannot even get their papers published under distributable terms or Free terms. The Open Access people have issues precisely because researchers don't want to take the time to learn about all that and work through it, and copyright law, by default to all-rights-reserved, doesn't help in the least. (This is one reason why such people try to get federal-supported papers to be mandated to be open access; defaults are incredibly important.) In many cases, they don't have permission, or it's too difficult to figure out who has permission. And publishers occasionally do nasty things like fire off DMCA takedowns in all directions. (The ACM being just the latest example.)

It should be possible. It often is. It's also possible to run a marathon backwards.

This can be mitigated by using human-readable formats wherever possible.

The standard recommendation, incidentally. But this is not a cureall because data requires interpretation, and the entire computing world never has and never will switch entirely to textual formats. And as long as old binary or hard-to-read data is around, the costs will be paid. The Daily WTF as well furnishes evidence that even textual formats can require reverse-engineering (to say nothing of the reputation scientists have for bad coding).

The only metadata we'll really need for these is a unique id that can be included in papers which wish to cite the dataset.

I have a problem, someone says; I know, I'll use a Global Unique ID... A ID is perhaps the simplest possible solution, but it only works if you never need to do anything besides answer the question 'is this dataset I'm looking at the one whose ID I know?' You don't get search, you don't get history, or descriptions, or locations, or anything. One could be in the position of the would-be BitTorrent leech: one has the hashes and .torrent one needs, but there don't seem to be any seeds...

I didn't mean accessibility in the sense of catering to the blind (although that is an issue, textual formats alleviate it). I meant more along the lines of community issues, it needs to be publicly online, it needs to be well-known, needs to be well-used, easily searched or found, and have zero friction for use. It cannot be Citizendium; it must be Wikipedia. It cannot be like the obscure Open Access databases libraries try to maintain; it must be like ArXiv. There are scads of archive sites and libraries and whatnot; no one uses them because they're too hard to remember which one to use when. Archive services benefit heavily from network effects.

your state of mind at the time you flipped the coin is a perfectly ordinary fact about the world, which can influence your priors in a nicely Bayesian way

How do you think it influences the priors?

If you intend to flip the coin six times, then your null-hypothesis prior is "I will get 0 heads with probability 0.5^6, 1 head with probability 6*0.5^6, and so on". If you intend to flip until you get a tail, the prior is "Probability 0.5 of one flip, 0.25 of two flips", and so on.

That's the likelihood under p = 0.5, not the prior. We want to infer something about p, so the prior is a distribution on p, not on the data.

Sorry, I was confused. Let me try to rephrase. Given some prior, your state of mind before the experiment affects your prediction of the outcome probabilities, and therefore informs your evaluation of the evidence. I should perhaps have said "affects the posterior" rather than "the prior".

The exact example you've given (binomial versus negative binomial sampling distribution) is actually a *counterexample* to the above assertion. Those two distributions have the same likelihood function, so the evaluation of the evidence is the same under both scenarios. It's true that the prior predictive distributions are different, but that doesn't affect the posterior distribution of the parameter.

Really? I find that counterintuitive; could you show me the calculation?

Suppose that there are two sampling distributions that satisfy (sorry about the lousy math notation) the proportionality relationship,

Pr1(data | parameter) = k * Pr2(data | parameter)

where k may depend on the data but not on the parameter. Then the same proportionality relationship holds for the prior predictive distributions,

Pr1(data) = Integral { Pr1(data | parameter) * Pr(parameter) d(parameter) }Pr1(data) = Integral { k * Pr2(data | parameter)

*Pr(parameter) d(parameter) }*

Pr1(data) = kIntegral { Pr2(data | parameter)

Pr1(data) = k

*Pr(parameter) d(parameter) }*

Pr1(data) = kPr2(data)

Pr1(data) = k

Now write out Bayes' theorem:

Pr(parameter | data) = Pr(parameter) * Pr1(data | parameter) / Pr1(data)Pr(parameter | data) = Pr(parameter) * k

*Pr2(data | parameter) / (k*Pr2(data) )

Pr(parameter | data) = Pr(parameter) * Pr2(data | parameter) / Pr2(data))

So it doesn't matter whether the data were sampled according to Pr1 or Pr2. You can check that the binomial and negative binomial distributions satisfy the proportionality condition by looking them up in Wikipedia.

Your argument is convincing; I sit corrected.

Although this is a rather trite comment. This site is lick crack to me. I can follow links in the posts and get absolutely lost in thoughts that all have some bearing on life of which most people are generally never aware of the importance.