How to Evaluate Data?

jetm

How to Evaluate Data?

post by jetm · 2013-04-09T04:10:30.365Z · LW · GW · Legacy · 45 comments

45 comments

What I'm trying to figure out is, how to I determine whether a source I'm looking at is telling the truth? For an example, let's take this page from Metamed: http://www.metamed.com/vital-facts-and-statistics

At first glance, I see some obvious things I ought to consider. It often gives numbers for how many die in hospitals/year, but for my purposes I ought to interpret it in light of how many hospitals are in the US, as well as how many patients are in each hospital. I also notice that as they are trying to promote their site, they probably selected the data that would best serve that purpose.

So where do I go from here? Evaluating each source they reference seems like a waste of time. I do not think it would be wrong to trust that they are not actively lying to me. But how do I move from here to an accurate picture of general doctor competence?

45 comments

Comments sorted by top scores.

comment by Morendil · 2013-04-10T13:52:37.025Z · LW(p) · GW(p)

Overall the contents of the linked page make me want to update quite a bit away from trusting MetaMed. One more example:

One million children every year have had unnecessary CT scans, which risks exposing them to radiation levels up to those experienced by survivors of Hiroshima and Nagasaki.

Compare with this excerpt from the primary source, which presumably serves as the basis for the claim:

Most of the quantitative information that we have regarding the risks of radiation-induced cancer comes from studies of survivors of the atomic bombs dropped on Japan in 1945. Data from cohorts of these survivors are generally used as the basis for predicting radiation-related risks in a population because the cohorts are large and have been intensively studied over a period of many decades, they were not selected for disease, all age groups are covered, and a substantial subcohort of about 25,000 survivors received radiation doses similar to those of concern here — that is, less than 50 mSv.

The primary source is not claiming that "a CT scan exposes you to Hiroshima-Nagasaki survivor radiation levels". It is saying the converse - "some atomic bomb survivors received doses low enough to be comparable to CT scans". The phrase "survivors of Hiroshima and Nagasaki" is pure fear-mongering - how much radiation the typical atomic bomb received is not public knowledge, so we'll tend to think in terms of worst cases (this handy chart might help). The average dose received, however, was 210 mSV according to one source I consulted; this is four times the high end dose from a pediatric CT, with the low end around 5 mSV. The statement from MetaMed is perhaps not an outright lie but it is at least grossly misleading.

For a business which has been touted right here and by no less than Eliezer himself as providing "actual evidence-based healthcare", this is a little worrisome.

(ETA: contrary to appearances I'm not actually trying to take over this whole post's discussion area, but I happen to get easily nerd-sniped by fact-checking exercises and easily worked up when I get a sense that someone is trying to pull a fast one on me.)

Replies from: private_messaging

↑ comment by private_messaging · 2013-04-26T10:49:44.930Z · LW(p) · GW(p)

Alternatives in medicine are primarily a domain of crackpots and scam artists, and it is of no surprise what so ever for me that the individuals involved in this sort of thing would capitalize on fear of radiation.

The doses are highly misleading; the CT scan is typically not a whole body exposure, and the cancer risk is proportional to tissue-adjusted whole body exposure not the organ doses. Which, for the CT scan of the head, is 1..2 mSv . The annual dose from all sources is listed as 2.4 mSv in US .

While there's little question (based on our understanding of cancer and radiation) that the risk continues linearly at arbitrarily low doses, the risk of 1..2 mSv exposure is small and is utterly dwarfed by the risks inherent in ordering some non domain expert medical advice over internet. Especially considering that there are guidelines for when to do and not to do CT scans, compiled by experts who work on this specific issue for far longer time, and considering that there are considerable risks involved in not doing a CT scan.

comment by Morendil · 2013-04-10T09:32:17.965Z · LW(p) · GW(p)

The "98,000 patients" claim is really interesting as an example of Dark Arts, aside from its having been debunked often.

It is often presented as follows: "98,000 deaths per year from medical errors (the equivalent of a jumbo jet crashing every day)".

It would be... provided every single jumbo jet flying in the US was populated by people already seriously ill or injured in the first place, rather than (as is actually the case) not only healthy but generally also wealthy passengers.

Of course you're supposed to overlook that trivial difference in the demographics of people who are in planes and those who are in hospitals, and picture hospitals killing healthy rich people by the planeload.

(This also suggests that "number of deaths" is a poor metric for making such estimates and comparisons; it would be better to compute "overall loss of expected QALYs resulting from preventable mistakes in medical care" and compare that with aggregate loss of QUALYs from other causes. Of course that's much less catchy.)

Replies from: Morendil

↑ comment by Morendil · 2014-11-16T18:01:14.699Z · LW(p) · GW(p)

Interestingly this article offers a QUALY-based economic estimate, but for some weird reasons plucks a wild ass guess as to the average number of years of life lost as a result of medical errors - ten years, with not the slightest justification. Of course this leads to a largish estimate of total impact.

This other article updates the estimates of annual deaths in the US to 400,000 with a lower bound of 210,000. This may be the result of misapplying an estimate of what fraction of adverse events are preventable - this was estimated on the overall sample (including non-fatal adverse events) but then applied to the much smaller set of fatal adverse events. Most fatal events result from surgery, which the same article notes has a much lower rate of "preventable" events, but I can't see that the total deaths estimate accounts for that.

comment by Morendil · 2013-04-09T18:12:10.952Z · LW(p) · GW(p)

I have some experience with this.

Some "facts" just set my spidey-sense tingling, and I find it usually well worth the time to check out the references in such case. In general, with the slightest doubt I will at least Google the reference and check out the abstract - this is quick and will at the least guarantee that the source does exist.

Particular things that set my spidey-sense off are:

sensationalistic claims - any that is ostensibly used to shock the reader into action
excess precision - like claims about "56% of all software projects" vs "roughly 60% in the population we sampled"
excess confidence about hard-to-test claims, in software those tend to be "productivity" claims
claims that are ambiguous or that would be very hard to confirm experimentally, e.g. the well-known canard about how many times a day men think about sex; basically "is this even a sane question to ask"
hard-to-find primary sources - when you can't readily check it, a claim becomes more suspicious
abstracts that don't contain the claim being put forward - it's more suspicious when you cite a paper for a tiny bit buried deep within
(ETA) "phone game" claims - a citation to a citation of a citation, etc. with many indirections before reaching the primary source

Let's look at some specifics of the page you cite - at the outset we note that it's designed to be sensationalistic, a marketing brochure basically. It's up to you to factor that into your assessment of how much you trust the references.

"As many as 98,000 people die in hospitals each year as a result of medical errors" - doesn't feel implausible, but a quick Google for US annual death rate - it turns out to be twice the death rate for suicide. This document seems to contradict the finding, I'd check out the reference
"Doctors spend an average of only ten minutes with patients" - an average like that isn't too hard to work out from a survey and squares with personal experience, I'd take it at face value
"By some estimates, deaths caused by medicine itself total 225,000 per year" - excess precision for a phrase as vague as "caused by medicine itself", I'd check out the reference just to know what that's supposed to mean
"Most published research findings are false" - this is the title of the Ioannidis article, and should not be taken at face value to mean "all research in all fields of medicine", read the ref for sure
"Up to 30% of patients who die in hospitals have important diseases or lesions that are not discovered until after death" - I'd want to know more about how they estimate this - are we talking about extrapolation from the few deaths which result in autopsy?
"It takes an average of 17 years for scientific research to enter clinical practice" - uh, maybe; somewhat ambiguous (what's an objective criterion for "entering clinical practice"?)
"In oncology alone, 90% of preclinical cancer studies could not be replicated." - I get into trouble almost immediately trying to check this reference, given as " Begley, C. G. (n.d.). Preclinical cancer research, 8–10.": Google gives me partial matches on the title but no exact matches, (n.d.) means "No date" which is kind of weird; this does have a publication date and even a URL
"deaths from cancer have barely been touched" - I wouldn't be surprised, cancer is a tough bastard
"If a primary care physician provided all recommended [...] care [...] he would need to work 21.7 hours a day" - excess precision (also a hypothetical, so not really "evidence"); check the source

Also, this is a Web page, so I get suspicious on principle that no hyperlinks are provided.

Replies from: somervta

↑ comment by somervta · 2013-04-12T04:58:31.510Z · LW(p) · GW(p)

"As many as 98,000 people die in hospitals each year as a result of medical errors" - doesn't feel implausible, but a quick Google for US annual death rate - it turns out to be twice the death rate for suicide. This document seems to contradict the finding, I'd check out the reference

This number may not only include US data.

Replies from: Morendil

↑ comment by Morendil · 2013-04-12T05:56:56.738Z · LW(p) · GW(p)

Indeed. And how do we find that out?

comment by Morendil · 2013-04-09T19:25:27.441Z · LW(p) · GW(p)

On the 98,000 figure this may bring some balance:

Similar to previous studies, almost a quarter (22.7%) of active-care patient deaths were rated as at least possibly preventable by optimal care, with 6.0% rated as probably or definitely preventable. Interrater reliability for these ratings was also similar to previous studies (0.34 for 2 reviewers). The reviewers' estimates of the percentage of patients who would have left the hospital alive had optimal care been provided was 6.0% (95% confidence interval [CI], 3.4%-8.6%). However, after considering 3-month prognosis and adjusting for the variability and skewness of reviewers' ratings, clinicians estimated that only 0.5% (95% CI, 0.3%-0.7%) of patients who died would have lived 3 months or more in good cognitive health if care had been optimal, representing roughly 1 patient per 10 000 admissions to the study hospitals.

(...)

Medical errors are a major concern regardless of patients' life expectancies, but our study suggests that previous interpretations of medical error statistics are probably misleading.

(...)

In an exchange about the validity of these estimates, McDonald et al argued on theoretical grounds that these statistics are likely overestimates. They were particularly concerned about the lack of consideration of the expected risk of death in the absence of the medical error. Indeed, these statistics have often been quoted without regard to cautions by the authors of the original reports, who note that physician reviewers do not believe necessarily that 100% of these deaths would be prevented if care were optimal.

(...)

As predicted on theoretical grounds, many deaths reportedly due to medical errors occur at the end of life or in critically ill patients in whom death was the most likely outcome, either during that hospitalization or in the coming months, regardless of the care received. However, this was not the only—or even the largest—source of potential overestimation. Previously, most have framed ratings of preventable deaths as a phenomenon in which a small but palpable number of deaths have clear errors that are being reliably rated as causing death. Our results suggest that this view is incorrect—that if many reviewers evaluate charts for preventable deaths, in most cases some reviewers will strongly believe that death could have been avoided by different care; however, most of the "errors" identified in implicit chart review appear to represent outlier opinions in cases in which the median reviewer believed either that an error did not occur or that it had little or no effect on the outcome.

ETA: see also, primary source for the 98,000 figure (try and find it!), this discusses the 98,000 figure as a Fermi estimate

comment by Morendil · 2013-04-09T18:22:14.159Z · LW(p) · GW(p)

Related to this, I've been wondering if I should write a post based on this G+ blog, but aimed at LW readers specifically and focusing on probabilistic thinking about the claim.

To recap: we have this supposed "survey" of U.S. Defense projects claimed to have taken place in 1995, and to have looked at $37Bn worth of software development projects. It classifies them into five categories (from "OK" to "bad" to "horrible"), it doesn't really matter what they are, we can call them A,B,C,D,E. There's a certain allocation: A:46%, B: 29%, C: 20%, D: 3%, E: 2%.

But we also find an earlier (1979) study, with a more credible primary source. Its five categories are labeled exactly the same, its sample size is much smaller - 9 projects for $7 million total. The allocation is nearly the same: A:47%, B: 29%, C: 19%, D: 3%, E: 2%.

The article I link in the G+ post, written by someone with a PhD, remarks on this coincidence:

Individually, these studies indicate that the success rate for U. S. Government outsourcing has historically been very low. Together they indicate that the success rate has not improved despite the introduction of new technologies and management procedures over the years.

The exercise consists in working out somewhat rigorously the probability that, given the hypothesis "there exists a fixed probability with which software projects fall into categories A,B,C,D,E" you would get, within 1%, the exact same results from a huge ($37Bn) survey as you'd have gotten from an independent and much smaller sample.

(Intuitively, this is a little like saying you've found a 9-people family whose polling results exactly predict the result of a national election with 5 candidates.)

Replies from: gwern

↑ comment by gwern · 2013-04-10T17:52:07.661Z · LW(p) · GW(p)

I've clicked through, and don't entirely understand your point. So the later figure is made up? Well, obviously that's a serious issue and like your earlier post on the diagram, interesting to read about. But if the two sets of frequencies were both, you know, real and not completely made up, and examined the same populations and classified their samples the same way, I'm not sure what's so wrong about comparing them with a chi-squared or something.

Replies from: Morendil

↑ comment by Morendil · 2013-04-10T20:08:41.260Z · LW(p) · GW(p)

Thanks for the feedback. Maybe I can better understand how what's blindingly obvious to me doesn't jump out at everyone else.

I don't know how or when to use a chi-squared test. What I did was assume - for the sake of checking my intuition - that the two sets of frequencies were indeed not made up.

To work out probabilities, you need to have some kind of model. I decided to use the simplest sampling model I could think of, where in both cases any given IT project has independently a fixed probability of turning out in one of the categories A, B, C, D, E.

The 1995 "study" has a sample size of $37Bn - this in fact turns out to match estimates of the entire DoD spend on IT projects in that year. So if these numbers are correct, then the frequencies must be precisely the probabilities for any given project to fall into the buckets A, B, C, D or E.

What I did next was work out some reasonable assumptions for the 1979 set of frequencies. It is drawn from a sample of 9 projects totaling $6.8M, so the mean project cost in the sample is $755K, and knowing a few other facts we can compute a lower bound for the standard deviation of the sample.

Given a mean, a standard deviation, and the assumption that costs are normally distributed in the population, we can approach by simulation an answer to the question "how likely is our assumption that both sets of frequencies are not made up and just happen to be within 1% of each other by chance, given the respective size of the samples".

The frequencies are given in terms of the categories as a proportion of the total cost. I wrote a Python program to repeatedly draw a sample of 9 projects from a population assumed to have the above mean cost and standard deviation, compute the relative proportions of the 5 categories, and return a true result if they were within 1% of the population probabilities.

Run this program passing the number of simulation runs as an argument. You can verify that the likelihood of reproducing the same set of frequencies within 1%, assuming that this happens by chance, is vanishingly small.

So, this "experiment" rejects the null hypothesis that the apparent match in both sets of frequencies is due to chance, as opposed to something else like one of them being made up.

(EDIT - removed the code I originally posted in this comment, a better version appears here.)

Replies from: gwern

↑ comment by gwern · 2013-04-10T20:45:26.970Z · LW(p) · GW(p)

I don't know how or when to use a chi-squared test. What I did was assume - for the sake of checking my intuition - that the two sets of frequencies were indeed not made up.

It's the usual go-to frequentist test for comparing two sets of categorical data. You say you have 4 categories with 10/4/9/3 members and you have your null hypothesis and you're interested in how often, assuming the null, results as extreme or more extreme than your new data of 200/80/150/20 would appear. Like rolling a biased 4-sided dice.

(If you're curious, that specific made up example would be chisq.test(matrix(c(10,4,9,3,200,80,150,20), ncol = 2),) with a p-value of 0.4.)

The 1995 "study" has a sample size of $37Bn - this in fact turns out to match estimates of the entire DoD spend on IT projects in that year. So if these numbers are correct, then the frequencies must be precisely the probabilities for any given project to fall into the buckets A, B, C, D or E. What I did next was work out some reasonable assumptions for the 1979 set of frequencies. It is drawn from a sample of 9 projects totaling $6.5M, so the mean project cost in the sample is $755K, and knowing a few other facts we can compute a lower bound for the standard deviation of the sample.

This seems like a really weird procedure. You should be looking at the frequencies of each of the 4 categories, not messing around with means and standard deviations. (I mean heck, just what about 2 decades of inflation or military growth or cutbacks?) What, you think that the 1995 data implies that the Pentagon had $37bn/$755K=49006 different projects?

I don't know Python or NumPY and your formatting is messed up, so I'm not sure what exactly you're doing. (One nice thing about using precanned routines like R's chisq.test: at least it's relatively clear what you're doing.)

But we also find an earlier (1979) study, with a more credible primary source. Its five categories are labeled exactly the same, its sample size is much smaller - 9 projects for $7 million total. The allocation is nearly the same: A:47%, B: 29%, C: 19%, D: 3%, E: 2%.

Looking closer, I'm not sure this data makes sense. 0.02 9 is... 0.18. Not a whole number. 47% 9 is 4.23. Also not a positive integer or zero. 0.29 * 9 is 2.61.

Sure, the percentages do sum to 100%, but D and E aren't even possible: 1/9 = 11%!

Replies from: Morendil, Morendil, Morendil

↑ comment by Morendil · 2013-04-12T17:03:59.239Z · LW(p) · GW(p)

Looking closer, I'm not sure this data makes sense. 0.02 * 9 is... 0.18. Not a whole number.

Basically, that's you saying exactly what is making me say "the coincidence is implausible". A sample of 9 will generally not contain an instance of something that comes up 2% of the time. Even more seldom will it contain that and an instance of something that comes up 3% of the time.

So, in spite of appearances, it seems as if our respective intuitions agree on something. Which makes me even more curious as to which of us is having a clack and where.

Replies from: gwern

↑ comment by gwern · 2013-04-12T17:26:14.785Z · LW(p) · GW(p)

No, my point there was that in a discrete sample of 9 items, 2% simply isn't possible. You jump from 1/9 (11%) straight to 0/9 (0%). But you then explained this impossibility as being the percentage of the total budget of all sampled projects that could be classified that way, which doesn't make the percentage mean much to me.

↑ comment by Morendil · 2013-04-10T20:49:01.972Z · LW(p) · GW(p)

not sure this data makes sense. 0.02 * 9 is... 0.18. Not a whole number

The proportions are by cost, not by counts. The 2% is one $118K project, which works out to 1.7% of the $6.8M total, rounded up to 2%.

Replies from: gwern

↑ comment by gwern · 2013-04-10T20:55:09.762Z · LW(p) · GW(p)

So you don't even know how many projects are in each category for the original study?

Replies from: Morendil

↑ comment by Morendil · 2013-04-10T20:58:59.487Z · LW(p) · GW(p)

Nope, aggregates is all we get to work with, no raw data.

Replies from: gwern

↑ comment by gwern · 2013-04-10T21:26:14.343Z · LW(p) · GW(p)

Yeah, I don't think you can do anything with this sort of data. And even if you had more data, I'm not sure whether you could conclude much of anything - almost identical percentages are always going to be highly likely, even if you go from a sample of 9 to a sample of 47000 or whatever. I'll illustrate. Suppose instead of being something useless like fraction of expenditure, your 1970s datapoint was exactly 100 projects, 49 of which were classified A, 29 of which were classified B, etc (we interpret the percentages as frequencies and don't get any awkward issues of "the average person has 1.9 arms"); and we took the mean and then estimated the $29b datapoint as having the same mean per project so we could indeed estimate that it was a sample of $37bn, and so the second sample was 490 times bigger (49k / 100), so when we look at A being 47% in the first sample we have n=47 projects, but when we look at A being 46% in the second sample, we this time have an n of 46*490=22540 projects. Straightforward enough, albeit an exercise in making stuff up.

So, with a sample 490 times larger, does differing by a percent or two offer any reason to reject the null that they have the same underlying distributions? No, because they're still so similar:

R> chisq.test(matrix(c(47,29,19,3,2, 46*490,29*490,20*490,3*490,2*490), ncol = 2), simulate.p.value = TRUE, B = 20000000)

Pearson's Chi-squared test with simulated p-value (based on 2e+07 replicates)

data:  matrix(c(47, 29, 19, 3, 2, 46 * 490, 29 * 490, 20 * 490, 3 *      490, 2 * 490), ncol = 2)
X-squared = 0.0716, df = NA, p-value = 0.9983

Replies from: Morendil, Morendil

↑ comment by Morendil · 2013-04-12T09:35:56.885Z · LW(p) · GW(p)

Yeah, I don't think you can do anything with this sort of data.

I don't see why I should give up just because what I've got isn't convenient to work with. The data is what it is, I want to use it in a Bayesian update of my prior probabilities that the 1995 data is kosher or made up.

Intuitively, the existence of categories at 2% and 3% make the conclusion clear. If the 1995 data isn't made up, then it is very rare that a project falls into one of these categories at all - respectively 1/50 and 1/30 chances. So the chance that our small sample of 9 projects happens to contain one each of these kinds of projects is very small to start with, about 9/150. Immediately this is strong Bayesian evidence against the null hypothesis.

Do you disagree?

My more elaborate procedure is only trying to refine this judgment by taking into account the entire joint probability distribution and trying to "hug the query" as much as possible. With the simulation I can not only pinpoint how astronomically unlikely the coincidence is, but also tell you how much "slop" in categories would be plausible. (If you look for a match within 5% rather than within 1%, then the probability of a coincidence rises to less-than-significant.)

I don't have to assume anything at all about the 1995 data (such as how many projects it represents), because as I've stated earlier $37B is the entire DoD spend in that year - if the data isn't made up then it amounts to an exhaustive survey rather than a sampling, and thus the observed frequencies are population frequencies. I treat the 1995 data as "truth", and only need to view the 1979 as a sampling procedure.

Here is a corrected version of the code. I've also fixed the SD of the sample, which I miscalculated the first time around.

(My reasoning is as follows: assume the costs of the projects are drawn from a normal distribution. Then we already know the mean ($6.8 / 9 = $755K), we know that one project cost $119K and another $198K (accounting for the 2% and 3% categories respectively), so the "generous" assumption is that the other 7 projects were all the same size ($926K), giving us the tightest normal possible.)

Replies from: gwern, Kindly

↑ comment by gwern · 2013-04-12T15:58:03.365Z · LW(p) · GW(p)

I don't see why I should give up just because what I've got isn't convenient to work with. The data is what it is, I want to use it in a Bayesian update of my prior probabilities that the 1995 data is kosher or made up.

Well heck, no one can stop you from intellectual masturbating. Just because it emits nothing anyone else wants to touch is not a reason to avoid doing it.

But you're working with made up data, the only real data is a high level summary which doesn't tell you what you want to know, you have no reasonably defined probability distribution, no defensible priors, and you're working towards justifying a conclusion you reached days ago (this exercise is a perfect example of motivated reasoning: "I dislike this data, and it turns out I am right since some of it was completely made up, and now I'm going to prove I'm extra-right by exhibiting some fancy statistical calculations involving a whole bunch of buried assumptions and choices which justify the already written bottom line").

My more elaborate procedure is only trying to refine this judgment by taking into account the entire joint probability distribution and trying to "hug the query" as much as possible. With the simulation I can not only pinpoint how astronomically unlikely the coincidence is, but also tell you how much "slop" in categories would be plausible. (If you look for a match within 5% rather than within 1%, then the probability of a coincidence rises to less-than-significant.)

I've already pointed out that under a reasonable interpretation of the imaginary data, the observed frequencies are literally the most likely outcome. Would your procedure make any sense if run on, say, lottery tickets?

I don't have to assume anything at all about the 1995 data (such as how many projects it represents), because as I've stated earlier $37B is the entire DoD spend in that year - if the data isn't made up then it amounts to an exhaustive survey rather than a sampling, and thus the observed frequencies are population frequencies...My reasoning is as follows: assume the costs of the projects are drawn from a normal distribution.

As I said. Assumptions.

Here is a corrected version of the code. I've also fixed the SD of the sample, which I miscalculated the first time around.

Although it's true that even if you make stuff up and choose to interpret things weirdly in order to justify the conclusion, the code should at least do what you wanted it to.

Replies from: Morendil

↑ comment by Morendil · 2013-04-12T16:51:55.227Z · LW(p) · GW(p)

Do you disagree that the presence in a small sample of two instances of very rare species constitutes strong prima facie evidence against the "coincidence" hypothesis?

I've already pointed out that under a reasonable interpretation of the imaginary data, the observed frequencies are literally the most likely outcome. Would your procedure make any sense if run on, say, lottery tickets?

I don't know what you mean by the above, despite doing my best to understand. My intuition is that "the most likely outcome" is one in which our 9-project sample will contain no project in either of the "very rare" categories, or at best will have a project in one of them. (If you deal me nine poker hands, I do not expect to see three-of-a-kind in two of them.)

I didn't understand your earlier example using chi-squared, which is what I take you to mean by "already pointed out". You made up some data, and "proved" that chi-squared failed to reject the null when you asked it about the made-up data. You assumed a sample size of 100, when the implausibility of the coincidence hypothesis comes precisely from the much smaller sample size (plus the existence of "rare" categories and the overall number of categories).

a perfect example of motivated reasoning

I'm experiencing it as the opposite - I already have plenty of reasons to conclude that the 1995 data set doesn't exist, I'm trying to give it the maximum benefit of doubt by assuming that it does exist and evaluating its fit with the 1979 data purely on probabilistic merits.

(ETA: what I'm saying is, forget the simulation, on which I'm willing to cop to charges of "intellectual masturbation". Instead, focus on the basic intuition. If I'm wrong about that, then I'm wrong enough that I'm looking forward to having learned something important.)

(ETA2: the fine print on the chi-square test reads "for the chi-square approximation to be valid, the expected frequency should be at least 5" - so in this case the test may not apply.)

Replies from: gwern

↑ comment by gwern · 2013-04-12T18:17:15.167Z · LW(p) · GW(p)

Do you disagree that the presence in a small sample of two instances of very rare species constitutes strong prima facie evidence against the "coincidence" hypothesis?

Why is coincidence a live hypothesis here? Surely we might expect there to be some connection - the numbers are ostensibly about the same government in the same country in different time periods. Another example of what I mean by you are making a ton of assumptions and you have not defined what parameters or distributions or sets of models you are working with. This is simply not a well-defined problem so far.

I didn't understand your earlier example using chi-squared, which is what I take you to mean by "already pointed out". You made up some data, and "proved" that chi-squared failed to reject the null when you asked it about the made-up data. You assumed a sample size of 100, when the implausibility of the coincidence hypothesis comes precisely from the much smaller sample size (plus the existence of "rare" categories and the overall number of categories).

And as I mentioned, I could do no other because the percentages simply cannot work as frequencies appropriate for any discrete tests with a specific sample of 9. I had to inflate to a sample size of 100 so I could interpret something like 2% as meaning anything at all.

Replies from: Morendil

↑ comment by Morendil · 2013-04-12T18:25:42.425Z · LW(p) · GW(p)

Why is coincidence a live hypothesis here?

What I mean by "coincidence" is "the 1979 data was obtained by picking at random from the same kind of population as the 1995 data, and the close fit of numbers results from nothing more sinister than a honest sampling procedure".

You still haven't answered a direct question I've asked three times - I wish you would shit or get off the pot.

(ETA: the 1979 document actually says that the selection wasn't random: "We identified and analyzed nine cases where software development was contracted for with Federal funds. Some were brought to our attention because they were problem cases." - so that sample would have been biased toward projects turned "bad". But this is one of the complications I'm choosing to ignore, because it weighs on the side where my priors already lie - that the 1995 frequencies can't possibly match the 1979 that closely without the latter being a textual copy of the earlier. I'm trying to be careful that all the assumptions I make, when I find I have to make them, work against the conclusion I suspect is true.)

Replies from: gwern

↑ comment by gwern · 2013-04-12T18:58:06.788Z · LW(p) · GW(p)

What I mean by "coincidence" is "the 1979 data was obtained by picking at random from the same kind of population as the 1995 data,

What population is that?

You still haven't answered a direct question I've asked three times - I wish you would shit or get off the pot.

You are not asking meaningful questions, you are not setting up your assumptions clearly. You are asking me, directly, "Is bleen more furfle than blaz, if we assume that quux>baz with a standard deviation of approximately quark and also I haven't mentioned other assumptions I have made?" Well, I can answer that quite easily: I have no fucking idea, but good luck finding an answer.

While we are complaining about not answering, you have not answered my questions about coin flipping or about lotteries.

Replies from: Morendil, Morendil

↑ comment by Morendil · 2013-04-12T20:59:59.108Z · LW(p) · GW(p)

you have not answered my questions about coin flipping or about lotteries.

(You didn't ask a question about coin flipping. The one about lotteries I answered: "I don't know what you mean". Just tying up any loose ends that might be interpreted as logical rudeness.)

↑ comment by Morendil · 2013-04-12T19:14:53.584Z · LW(p) · GW(p)

What population is that?

Answered already - if the 1995 data set exists, then it pretty much has to be a survey of the entire spend of the US Department of Defense on software projects; a census, if you will. (Whether that is plausible or not is a separate question.)

You are not asking meaningful questions

Okay, let me try another one then. Suppose we entered this one into PredictionBook: "At some point before 2020, someone will turn up evidence such as a full-text paper, indicating that the 1995 Jarzombek data set exists, was collected independently of the 1979 GAO data set, and independently found the same frequencies."

What probability would you assign to that statement?

I'm not trying to set up any assumptions, I'm just trying to assess how plausible the claim is that the 1995 data set genuinely exists, as opposed to its being a memetic copy of the 1979 study. (Independently even of whether this was fraud, plagiarism, a honest mistake, or whatever.)

Replies from: gwern

↑ comment by gwern · 2013-04-12T20:40:46.736Z · LW(p) · GW(p)

What probability would you assign to that statement?

Very low. You're the only one that cares, and government archives are vast. I've failed to find versions of many papers and citations I'd like to have in the past.

↑ comment by Kindly · 2013-04-12T18:55:25.082Z · LW(p) · GW(p)

Intuitively, the existence of categories at 2% and 3% make the conclusion clear. If the 1995 data isn't made up, then it is very rare that a project falls into one of these categories at all - respectively 1/50 and 1/30 chances. So the chance that our small sample of 9 projects happens to contain one each of these kinds of projects is very small to start with, about 9/150.

Given that we know nothing about how the projects themselves were distributed between the categories, we can't actually say this with any confidence. It's possible, for example, that the 2% category actually receives many projects on average, but they're all cheap.

If you assume that the project costs are normally distributed, then that assumption makes the 1979 data inherently unlikely, no matter how close the percentages are to 1995: the existence of a category receiving 2% of the funding means that at best you have a data point which is only 18% of the mean (and another point at 27%). That just doesn't happen for normal distributions (unless the variance is so large that the model becomes ridiculous anyway, due to the huge probability of it giving you negative numbers).

Replies from: Morendil

↑ comment by Morendil · 2013-04-13T10:54:04.599Z · LW(p) · GW(p)

It's actually quite plausible that cheaper projects have a greater chance of falling into the rare category of successful projects, as the original 1979 defined success - "used without extensive rework". It's also quite possible that project size isn't normally distributed.

What I seem to have trouble conveying is my intuition that the fit is too close to be true - that in general if you have a multinomial distribution with five categories, and you draw a small sample from that distribution, it is quite unlikely that your sample frequencies will come within 1% of the true probabilities.

The chi-squared test, if I've understood it correctly, computes the converse probability - the probability that your sample contains frequencies that are this far removed or more from the true probabilities, given the assumption that it's drawn from a distribution with those probabilities. In the case that concerns me the chi-square is obviously very small, so that the p-value approaches unity.

What I'm saying - and it may be a crazy thing to say - is that it's precisely this small distance from the true probabilities that makes me suspicious.

Replies from: Kindly

↑ comment by Kindly · 2013-04-13T13:17:32.596Z · LW(p) · GW(p)

I realize what you're getting at, and it is suspicious, I'm just saying that the probabilities you're trying to calculate for it aren't correct.

I'm also not sure what your alternate hypotheses are. There's no way that the 1979 data was fabricated to fit the 1995 percentages, is there? So any argument that casts doubt on the 1979 data being possible to begin with is going to penalize all possible alternate hypotheses. That's the problem with the normality assumption: assuming a normal distribution with any true mean makes the 1979 data unlikely, whether or not the percentages are suspiciously close.

Replies from: Morendil, Morendil

↑ comment by Morendil · 2013-04-15T15:56:37.084Z · LW(p) · GW(p)

I've just come across a more technical explanation than usual of "The Mendel-Fisher Controversy" which frames it as having been about formalizing an intuition of data "too good to be true" using chi-squared.

It is less well known, however, that in 1936, the great British statistician and biologist R. A. Fisher analyzed Mendel’s data and found that the fit to Mendel’s theoretical expectations was too good (Fisher 1936). Using χ2 analysis, Fisher found that the probability of obtaining a fit as good as Mendel’s was only 7 in 100,000. (source)

And this PDF or this page say pretty much the same.

Incidentally a very high P (>0.9) is suspicious, as it means that the results are just too good to be true! This suggests that there is some bias in the experiment, whether deliberate or accidental.

So, ISTM, gwern's analysis here leads to the "too good to be true" conclusion.

↑ comment by Morendil · 2013-04-13T14:35:17.379Z · LW(p) · GW(p)

There's no way that the 1979 data was fabricated to fit the 1995 percentages, is there?

No, I'm quite confident the 1979 document is genuine (call it 100% minus a hair). Just what the data represents is something else again - by the authors' own admission they worked with a biased sample.

The 1995 sample, assuming it is genuine, is quite unbiased - since it is (claimed to be) the entire population.

I'm also not sure what your alternate hypotheses are.

To me it seems quite likely that the 1995 "results" are artifactual: my main theory is that someone heard an oral presentation from the person cited as the author, conflated that presentation in their mind with the 1979 data, and a few years later presented a chimera of the two, attributing it to the speaker. Later authors just copied and pasted the claim and reference, neglecting to fact-check it.

the probabilities you're trying to calculate for it aren't correct

I'm willing to accept that. But if we agree that the close fit is suspicious, then I would hazard that we have some mathematical background for that intuition, and if so there must be at least some way of formalizing that intuition which is better than saying "I just don't know".

Conversely, if that intuition is in fact ungrounded (perhaps for the same reason we call "too improbable to be a coincidence" a winning lottery draw which pattern-matches something significant to us, like a birth date), there should be a way of formalizing that.

↑ comment by Morendil · 2013-04-11T05:23:28.651Z · LW(p) · GW(p)

So you wouldn't be surprised by my hypothetical scenario, where a family of 9 is claimed to poll exactly the same as the results in a national election?

Replies from: gwern

↑ comment by gwern · 2013-04-11T16:40:13.504Z · LW(p) · GW(p)

No, I would be surprised, but that is due to my background knowledge that a family unit implies all sorts of mutual correlations, ranging from growing up (if one's parents are Republicans, one is almost surely a Republican as well) to location (most states are not equally split ideologically), and worries about biases and manipulations and selection effects ("This Iowa district voted for the winning candidate in the last 7 elections!").

On the other hand, if you simply told me that 9 random people split 5-4 for Obama, I would simply shrug and say, "Well, yeah. Obama had the majority, and in a sample of 9 people, a 5-4 split for him is literally the single most likely outcome possible - every other split like 9-0 is further removed from the true underlying probability that ~52% of people voted for him. It's not all that likely, but you could say that about every lottery winner or every single sequence you get when flipping a fair coin n times: each possible winner had just a one in millions chance of winning, or each sequence had a 0.5^n chance of happening. But, something had to happen, someone had to win the lottery, some sequence had to be produced by the final coin flip."

↑ comment by Morendil · 2013-04-10T20:56:28.623Z · LW(p) · GW(p)

I'm not sure what exactly you're doing.

I think I've just spotted at least one serious mistake, so give me some time to clean this up. Probably I can do the same thing in R.

comment by [deleted] · 2013-04-09T05:32:06.387Z · LW(p) · GW(p)

It is more useful to determine whether a source you're looking at is not telling the truth. Find one black swan, you don't have to look at all possible swans to determine the claim "all swans are white" is not correct.

In the example you gave, identify ways to determine whether the source is not telling the truth. That could include inaccurate quotes, or accurate quotes of faulty data, or consulting the cited texts plus competing texts, but I'm not sure it can include avoiding reading the cited texts even if you think it's a waste of time.

Replies from: jetm

↑ comment by jetm · 2013-04-09T13:59:04.642Z · LW(p) · GW(p)

That makes a lot of sense. Looks like I'll be slogging through a lot of links then. Thank you for the tip!

comment by DaFranker · 2013-04-09T13:36:09.107Z · LW(p) · GW(p)

As Trevor_Blake said, there's very little you can do apart from actually checking some of the data. An alternative is to ask or pay someone else or a group to verify it for you.

Of course, there's always the option of coding a probabilistic engine that mines for stats and gives you reliability estimates of certain claims using some bayes-fu. But that takes math, programming, and lots of work.

Replies from: VCavallo

↑ comment by VCavallo · 2013-04-10T15:30:05.560Z · LW(p) · GW(p)

that takes math, programming, and lots of work

But sounds totally awesome. Especially if it can be created once and used over and over for different applications.

Replies from: DaFranker

↑ comment by DaFranker · 2013-04-10T16:11:37.298Z · LW(p) · GW(p)

Well, my naive first thought was to abuse the opencyc engine for a while so it starts getting good rough guesses of which particular mathematical concepts and quantities and sets are being referred to in a given sentence, and plug it either directly or by mass download and conversion into various data sources like WolframAlpha or international health / crime / population / economics databases or various government services.

But that still means doing math (doing math with linguistics) tons and tons of programming to even get a working prototype that understands "30% of americans are older than 30 years old", way more work than I care to visualize just to get the system to not explode and respond in a sane manner when you throw at it something incongruent ("30 of americans are 30% years old" should not make the system choke, for example), etc. And then you've got to build something usable around that, interfaces, ways to extract and store data, and then probably pack everything together. And once you're there, you probably want to turn it into a product and sell it, since you might as well cash in some money on all of this work. Then more work.

The whole prospect looks like a small asteroid rather than a mountain, from where I'm sitting. I am not in the business of climbing, mining, deconstructing and exporting small asteroids. I'll stick to climbing over mountains until I have a working asteroid-to-computronium converter.

Replies from: gwern

↑ comment by gwern · 2013-04-10T17:40:16.149Z · LW(p) · GW(p)

My suggestion would be to go via some sort of meta-analysis or meta-meta-analysis (yes, that's a thing); if you have, for example, a meta-analysis of all results in a particular field and how often they replicate, you can infer pretty accurately how well a new result in that field will replicate. (An example use: 'So 90% of all the previous results with this sample size or smaller failed to replicate? Welp, time to ignore this new result until it does replicate.')

It would of course be a ton of work to compile them all, and then any new result you were interested in, you'd still have to know how to code it up in terms of sample size, which sub-sub-field it was in, what the quantitative measures were etc, but at least it doesn't require nigh-magical AI or NLP - just a great deal of human effort.

Replies from: DaFranker

↑ comment by DaFranker · 2013-04-10T18:52:26.507Z · LW(p) · GW(p)

Nigh-magical is the word indeed. I just realized that if my insane idea in the grandparent were made to work, it could be unleashed upon all research publications ever everywhere for mining data, figures, estimates, etc., and then output a giant belief network of "this is collective-human-science's current best guess for fact / figure / value / statistic X".

That does not sound like something that could be achieved by a developer less than google-sized. It also fails all of my incredulity and sanity checks.

(it also sounds like an awesome startup idea, whatever that means)

Replies from: gwern

↑ comment by gwern · 2013-04-10T19:14:50.719Z · LW(p) · GW(p)

Or IBM-sized. But if you confined your ambitions to analyzing just meta-analyses, it would be much more doable. The narrower the domain, the better AI/NLP works, remember. There's some remarkable examples of what you can do in machine-reading a narrow domain and extracting meaningful scientific data; one of them is ChemicalTagger (demo), reading chemistry papers describing synthesis processes and extracting the process (although it has serious problems getting papers to use). I bet you could get a lot out of reading meta-analyses - there's a good summary just in the forest plot used in almost every meta-analysis.

comment by ChristianKl · 2013-04-09T17:16:46.040Z · LW(p) · GW(p)

If you have a specific claim where you aren't sure whether it's true http://skeptics.stackexchange.com/ is a good website. Start by seeking whether someone else already asked the question. If nobody has than you can open a new question.

comment by westward · 2013-04-09T05:53:37.555Z · LW(p) · GW(p)

Well...what is your goal in evaluating the truthfulness of these statements? Even if they're actually "true" do they help you meet your goals?

How to Evaluate Data?

Contents

45 comments