Error detection bias in research

neq1

Error detection bias in research

post by neq1 · 2010-09-22T03:00:33.555Z · LW · GW · Legacy · 37 comments

  Error scenarios
  How common are programming errors in research?
None
37 comments

I have had the following situation happen several times during my research career: I write code to analyze data; there is some expectation about what the results will be; after running the program, the results are not what was expected; I go back and carefully check the code to make sure there are no errors; sometimes I find an error

No matter how careful you are when it comes to writing computer code, I think you are more likely to find a mistake if you think there is one. Unexpected results lead one to suspect a coding error more than expected results do.

In general, researchers usually do have general expectations about what they will find (e.g., the drug will not increase risk of the disease; the toxin will not decrease risk of cancer).

Consider the following graphic:

Here, the green region is consistent with what our expectations are. For example, if we expect a relative risk (RR) of about 1.5, we might not be too surprised if the estimated RR is between (e.g.) 0.9 and 2.0. Anything above 2.0 or below 0.9 might make us highly suspicious of an error -- that's the red region. Estimates in the red region are likely to trigger serious coding error investigation. Obviously, if there is no coding error then the paper will get submitted with the surprising results.

Error scenarios

Let's assume that there is a coding error that causes the estimated effect to differ from the true effect (assume sample size large enough to ignore sampling variability).

Consider the following scenario:

Type A. Here, the estimated value is biased, but it's within the expected range. In this scenario, error checking is probably more casual and less likely to be successful.

Next, consider this scenario:

Type B. In this case, the estimated value is in the red zone. This triggers aggressive error checking of the type that has a higher success rate.

Finally:

Type C. In this case it's the true value that differs from our expectations. However, the estimated value is about what we would expect. This triggers casual error checking of the less-likely-to-be-successful variety.

If this line of reasoning holds, we should expect journal articles to contain errors at a higher rate when the results are consistent with the authors' prior expectations. This could be viewed as a type of confirmation bias.

How common are programming errors in research?

There are many opportunities for hard-to-detect errors to occur. For large studies, there might be hundreds of lines of code related to database creation, data cleaning, etc., plus many more lines of code for data analysis. Studies also typically involve multiple programmers. I would not be surprised if at least 20% of published studies include results that were affected by at least one coding error. Many of these errors probably had a trivial effect, but I am sure others did not.

37 comments

Comments sorted by top scores.

comment by CronoDAS · 2010-09-22T03:27:21.469Z · LW(p) · GW(p)

Feynman once talked about this specific issue during a larger speech:

We have learned a lot from experience about how to handle some of the ways we fool ourselves. One example: Millikan measured the charge on an electron by an experiment with falling oil drops, and got an answer which we now know not to be quite right. It's a little bit off, because he had the incorrect value for the viscosity of air. It's interesting to look at the history of measurements of the charge of the electron, after Millikan. If you plot them as a function of time, you find that one is a little bigger than Millikan's, and the next one's a little bit bigger than that, and the next one's a little bit bigger than that, until finally they settle down to a number which is higher.

Why didn't they discover that the new number was higher right away? It's a thing that scientists are ashamed of--this history--because it's apparent that people did things like this: When they got a number that was too high above Millikan's, they thought something must be wrong--and they would look for and find a reason why something might be wrong. When they got a number closer to Millikan's value they didn't look so hard. And so they eliminated the numbers that were too far off, and did other things like that. We've learned those tricks nowadays, and now we don't have that kind of a disease.

Replies from: Matt_Simpson, neq1, Clippy

↑ comment by Matt_Simpson · 2010-09-22T15:22:00.380Z · LW(p) · GW(p)

We've learned those tricks nowadays, and now we don't have that kind of a disease.

I always thought Feynman was overly optimistic here. Maybe it's true for physicists though.

↑ comment by neq1 · 2010-09-22T03:33:37.271Z · LW(p) · GW(p)

Good find. Thanks.

↑ comment by Clippy · 2010-09-22T22:19:26.455Z · LW(p) · GW(p)

Yes, that's a good point, and I've struggled with these issues a lot. It's related to the concept of an information cascade, and it's why CLIP (in revs after 2007) has mechanisms that force you to trace the source of a belief so that it doesn't "echo" and then amplify without bound.

In the scenario Feynman refers to, CLIP would have made you state the reason for the adjustment toward Millikan's result in subsequent experiments. Subsequent updates would then necessarily discount for this echo, preventing unwarranted re-corroboration of Millikan's value.

CLIP has a harder time with the problem that User:neq1 is referring to, of course, because of the issues that arise when computing probabilities of logical outputs.

CLIP = Clippy Language Interface Protocol, see link.

comment by RolfAndreassen · 2010-09-22T18:48:44.164Z · LW(p) · GW(p)

In particle physics we use blinding of variables, which is a bit distinct from what's called blind studies in medicine, to deal with this problem. I'll use my own thesis as an example: I was trying to measure charm mixing parameters, imaginatively denoted x and y. There were some theoretical predictions: they were expected to be less than 1%, x should be smaller in absolute magnitude than y, and they should have opposite sign; in addition there were some prior measurements that we would presumably not conflict with, within the errors.

Now, there was quite a bit of coding to do before I had a result, and a whole bunch of test runs on real data. So to avoid the problem outlined above, whenever I ran a test on actual data (as opposed to simulated data), I would not print out the actual results, but results with an unknown (random) number added. So the code would look like this (much simplified) :

double resultForX = getX();

double errorForX = getXError();

double blindValueX = getRandomUsingSeed(42);

print "Result is " + (resultForX + blindValueX) + " plus-minus " + errorForX;

So if I got, say, x=3%, very unexpected, I would not be tempted to go look for an error in the code; I'd have no idea whether I genuinely had an unexpected result, or a vanilla result with a big blinding factor.

Note that I used the same random seed every time, so I would have comparable results from run to run; if I changed something and suddenly had 1% in place of 3%, I knew something was up. But I still had no idea whether I had a New-Physics-indicating result or just a confirmation of theory.

To end blinding I had to get permission from a review committee (not my thesis committee, but other workers inside the same experiment); then I commented out the blinding lines and printed the true values. Alas, they were quite consistent with expectations.

There is a coda to this: At my thesis defense, one of the committee asked an excellent question, which I had to answer before I could submit the thesis. In the course of answering it, I did in fact find (to my horror!) a sign error in my code. Fortunately it did not change the final result very much, but it was highly embarrassing.

This approach isn't suitable to every problem, but we use it wherever we can.

Replies from: Morendil, PhilGoetz

↑ comment by Morendil · 2010-09-22T19:18:56.987Z · LW(p) · GW(p)

To end blinding I had to get permission from a review committee

Sounds a little bit like a code review. (And, pursuing a theme I've raised earlier, a probably effective tactic to leverage collective intelligence against individual bias.)

↑ comment by PhilGoetz · 2010-09-24T16:07:16.927Z · LW(p) · GW(p)

print "Result is " + (resultForX + blindValueX) + " plus-minus " + errorForX;

I don't understand how this is supposed to help. Or even not hurt.

Replies from: RolfAndreassen

↑ comment by RolfAndreassen · 2010-09-24T18:55:41.800Z · LW(p) · GW(p)

There is a true result of the run, stored in the variable resultForX. While I'm developing my code, I don't want to know that true value, because of the surprisingness bias as outlined in the post. I do however want to be able to compare results between test runs. Thus I add a random value, blindValueX, which I do not know; I only know the random seed that produces it. I never print the true result until I've finalised the code and done all my testing for systematic errors.

Replies from: PhilGoetz

↑ comment by PhilGoetz · 2010-09-27T22:14:05.326Z · LW(p) · GW(p)

Okay; I see. Is that a common practice? I'd never heard of it before.

Replies from: RolfAndreassen

↑ comment by RolfAndreassen · 2010-09-28T16:08:05.848Z · LW(p) · GW(p)

It is, at any rate, quite common with particle physics, although not every analysis uses it. I can't speak to other fields.

Hmm. I wonder if this would make a top-level post, with some example plots and more in-depth description? Practical methods used in science for avoiding bias, 101.

comment by Alicorn · 2010-09-22T03:07:51.206Z · LW(p) · GW(p)

Your visual aids are very good - not too busy or gimmicky, just helpful pictorial representations.

Replies from: Clippy

↑ comment by Clippy · 2010-09-22T22:12:25.050Z · LW(p) · GW(p)

Your comment is very good - not too busy or gimmicky, just helpful transmission of your thoughts.

comment by Morendil · 2010-09-22T08:02:21.300Z · LW(p) · GW(p)

I would not be surprised if at least 20% of published studies include results that were affected by at least one coding error.

My intuition is that this underestimates the occurrence, depending on the field. Let us define:

CE = study has been affected by at least one coding error
SP = study relies on a significant (>500 LOC) amount of custom programming

Then I'd assign over 80% to P(CE|SP).

My mom is a semi-retired neuroscientist, she's been telling me recently how appalled she's been with how many researchers around her are abusing standard stats packages in egregious ways. The trouble is that scientists have access to powerful software packages for data analysis but they often lack understanding of the concepts deployed in the packages, and consequently make absurd mistakes.

"Shooting yourself in the foot" is the occupational disease of programmers, and this applies even to non-career programmers, people who program as a secondary requirement of their job and may not even have any awareness that what they're doing is programming.

Replies from: neq1, dclayh

↑ comment by neq1 · 2010-09-22T10:56:01.789Z · LW(p) · GW(p)

In cases where a scientist is using a software package that they are uncomfortable with, I think output basically serves as the only error checking. First, they copy some sample code and try to adapt it to their data (while not really understanding what the program does). Then, they run the software. If the results are about what they expected, they think "well, we most have done it right." If the results are different than they expected, they might try a few more times and eventually get someone involved who knows what they are doing.

↑ comment by dclayh · 2010-09-22T17:59:18.032Z · LW(p) · GW(p)

I strongly agree that you're more likely to get wrong results out of someone else's code than your own, because you tend to assume that they did their own error checking, and you also tend to assume that the code works the way you think it should (i.e. the way you would write it yourself), either or both of which may be false.

This is what led to my discovering a fairly significant error in my dissertation the day before I had to turn it in :) (Admittedly, self-delusion also played a role.)

comment by byrnema · 2010-09-22T12:51:47.112Z · LW(p) · GW(p)

From my own research, in which I often form and report conclusions based on the output of some code, I know this is very true.

I've come to the conclusion that a code (a code that I've written, anyway) is never bug free, and the only material question is whether the bugs that are there affect anything.

What I've learned to do while writing the code is to run it on a variety of data for which I already know what the output should be. (For example, if a simple code line to supposed to output the maximum of a function, I input a zero function, and then some kind of step function, and then a sine function, etc.) I do this no matter how straight-forward and simple the code seems to be, it only takes a few seconds and most bugs are obvious in retrospect. I break up the code into small pieces and retest different subsets of the small pieces.

When I finally use the code on something I'm going to report, I've developed a strategy of using lots of ad hoc methods of trying to anticipate the proper output of the code even before I run it. If I get something unexpected, I will go back to studying the code until I find a bug or need to update my ad hoc intuitions. While I'm searching for a bug, by following the steps of the code in more and more detail, I begin to understand in greater detail why/how the unexpected result came about -- it will either not make sense and I can better hone in on the problem, or I find out why my ad hoc intuitions about the problem were wrong.

I cannot imagine getting a reliable result without the interplay of testing my intuitions against the code and my code against my intuitions. Pretty much, they train each other and I call that research.

Still, despite my efforts, I occasionally find a non-negligible bug when I look at the code later, especially if using it for a different application and a new context. This is a point of some embarrassment and anxiety for me, but I'm not sure what can be done about it. I expand the scope of my tests and would probably appear a bit obsessive compulsive in my debugging process if anyone was looking over my shoulder.

I would never want to work on a project that I had no intuition for. My result would almost certainly be incorrect.

I will add that what saves us in scenario is what Anna Salamon observed (Why will a randomly chosen eight-year-old fail a calculus test?) -- the correct solution is a tiny point in the space of possible answers. If your code is wrong, your unlikely to closely match your correct or incorrect intuition, so you know you need to go back.

Replies from: byrnema

↑ comment by byrnema · 2010-09-22T16:32:44.339Z · LW(p) · GW(p)

Your point in the post above is that you won't catch the bug in your code if it outputs an answer that is close. How frequently can we expect this to occur?

If the bugs have the affect of scaling your answer by a small amount, then this is quite likely. Further, if there are a lot of such bugs, some increasing and some decreasing the final output, a person can systematically skew their results towards the one they expect by ignoring mistakes in the 'wrong' direction ("well, that error isn't be the problem") and fixing ones in the 'right' direction. So if you find a lot of bugs in the direction you want to correct in, you should also be sure and fix just as many in the opposite direction.

comment by sketerpot · 2010-09-22T22:47:41.160Z · LW(p) · GW(p)

Bioinformatics is a neat example of how this bias can arise. Forgive me: I'm going to go into excessively nerdy detail about a specific example, because bioinformatics is cool.

Suppose that a biologist has amino acid sequences for 100 species' versions of the same protein. The species are not closely related, and the protein sequences for each species have a lot of variation between them. The biologist wants to find parts of the protein sequence that have remained similar for a long evolutionary time. The usual way to do this is to try to line up matching parts of the sequences, inserting gaps and accepting mismatches where necessary.

Aligning multiple sequences is a very hard problem, computationally, so we have to use approximate methods. The most common way is to break it down to a problem we can solve much more easily: aligning two sequences, then computing an average sequence for that clump which can be used to add another sequence to it. And another, and another. These algorithms compare all pairs of sequences to measure how similar they are, and then starts clumping together similar sequences in a tree that looks a lot like a diagram of evolutionary ancestry. At the end, your sequences should be aligned acceptably. You hope.

Of course, this assumes that some of the sequences are more closely related than others, and that you can form a nice tree shape. And it's very approximate, and there's lots of opportunity for error to creep in. So for some data, this works great, and for some data, it gives nonsense. Another method looks for small, very common subsequences, and iteratively refines the alignment based on these. Again, this works great for some data, and not so well for others. And then of course there are dozens of other methods, based on things like genetic algorithms, or simulated annealing, or hidden Markov models, and all of these have times when they work well and times when they don't.

So what does a biologist do? Try several methods, of course! The Right Way to do this is to run the same input through a bunch of algorithms, check them for agreement, see if several algorithms are giving the same results, and then apply any biological knowledge you may have to help decide what's working. The way that I'm sure some people end up using is that they reach for whatever multiple sequence alignment program they like most, run it, and trust its output if it's not too surprising. If it surprises them, they might try another algorithm. If that produces something more like what they were expecting to see, they may then stop, because they're busy, damn it, and they have work to do. Most biologists really don't want to be computer scientists, and it's tempting to treat the computer as a magic answer machine. (The obvious solution here is to rent some computing cloud time and run a bunch of different algorithms on your data every time you do multiple sequence alignment, and have your software automatically compare the results. More dakka helps.)

I don't know how common this sort of error is in practice, but there's certainly the potential for it.

comment by JoshuaZ · 2010-09-22T03:44:37.133Z · LW(p) · GW(p)

This happens in other ways also. Mathematicians are much more likely to look suspiciously at a proof of a surprising result than at a proof of a result that seems very reasonable.

comment by Jonathan_Graehl · 2010-09-22T20:27:25.638Z · LW(p) · GW(p)

Most of the NLP (natural language processing, not the other NLP) research I do is loosely validated by a shared task and competitive performance on held-out test data. There is not much chance that a bug leads to higher task accuracy (usually agreement with human judgments). But it's true that if you had a great idea which could lead to a huge improvement, but only shows a small one because your implementation has bugs, then you may assume that the idea was not really that effective, and not hunt for the bugs. Whereas if the performance is actually worse than before this change, and you know the change to be a good idea, you will hunt very hard for the bugs.

I suppose wherever progress is most incremental (everyone is using the same basic systems with the novel part being some small addition), then there's also a real risk of bad or buggy small changes being published because they happen by pure chance to give a slightly higher accuracy on the standard test data. But such random-noise additions will in fact do worse on any different test set, and there is some chance of detecting them using significance tests.

I guess it is a problem for the field that new (different domain or merely freshly generated) test sets aren't automatically used to score all published past systems to detect that effect. This should theoretically be possible by requiring open-source implementations of published results.

comment by JenniferRM · 2010-09-22T18:48:42.006Z · LW(p) · GW(p)

I completely agree that debugging processes and confirmation biases interact (and I've seen the effect enough times in practice to understand that my first three hypothesis iterations about nearly any aspect of a giant data set will probably be wrong enough to detect if I try), but for the sake of debiasing our thinking about debiasing techniques, I wonder about opinions about things we should expect to help rather than hurt?

For example, I find sometimes that I create pipelines of data where there will be a few gigs stored at each stage with less than a hundred lines of code between each stage. If I test the data at each stage in rather simple ways (exploring outliers, checking for divide by zero, etc) that I gain additional "indirect code verification" by noting that despite the vast space of possible output data with 2^10^10 degrees of freedom, my scripts are producing nothing that violates a set of filters that would certainly be triggered by noise and "probably" (craft and domain knowledge and art certainly goes into this estimate) be triggered by bugs.

A summary of these experiences would be something like "the larger the data set the less likely that bugs go undetected". And programmable computers do seem to enable giant data sets... so...

I'm not sure I have enough information to say whether the expected rate of bug-based confirmations should be 20% or 90% when certain kinds of software or coding techniques are involved, but it seems like there can be other "big picture factors" that would reduce biases.

Do you agree that larger data sets create more room to indirectly verify code? Can you think of other factors that would also push scientific programming in a positive direction?

comment by Douglas_Knight · 2010-09-22T18:30:08.805Z · LW(p) · GW(p)

Rather tangentially...
Here is a linguistic danger of your article. It's not much of a danger if one reads the text in order, but if one starts with headlines, "How common are coding errors in research?" could be confusing because many fields of research (probably most researchers) use "coding" to mean "labeling." eg, when a human watches a tape of a monkey and codes it as "looked left" or "looked right." And they keep track of rates of "coding errors."
So I'd suggest changing that headline to "programming."

Replies from: neq1

↑ comment by neq1 · 2010-09-23T00:30:45.751Z · LW(p) · GW(p)

good point

comment by PhilGoetz · 2010-09-24T16:03:24.646Z · LW(p) · GW(p)

This is a good point - and I would add that most of the errors that have this effect are not coding errors, but conceptual errors. A good practice, before performing an experiment, is to ask yourself, "What would I think if this just didn't work?" And try, hard, to come up with a justification for it not working. Sometimes I will realize there's an additional factor that may cause unexpected results.

Most of the time, though, I can't think how the experiment could fail until it actually fails. And then it seems obvious in retrospect why it failed.

comment by cousin_it · 2010-09-23T00:27:19.649Z · LW(p) · GW(p)

neq1, you have somehow managed to provoke the most informative set of comments I've seen on LW in awhile. Everyone's giving interesting details about their own research. Awesome! Upvoted just for that. (Though the post is fine, too.) And cheers to the commenters too - this is the way life should be like.

Tangentially related: Phil Goetz's Computer bugs and evolution (with further details in the comments).

Also, your link to Wikipedia's page on confirmation bias has a typo.

comment by JGWeissman · 2010-09-22T17:21:26.821Z · LW(p) · GW(p)

So, before you even run your code on the actual data, choose several test values in each region, and for each test value generate the data you would expect if that test value were true and validate that your analysis code recovers the test value from the generated data. When the analysis code passes all tests, run it against the real data.

Replies from: DSimon

↑ comment by DSimon · 2010-09-23T17:41:19.531Z · LW(p) · GW(p)

I applaud this idea, though with one addition: ideally you chould choose test values and expected results before you even write any of the simulation code.

This is part of a really helpful bug-preempting technique called Test Driven Design (or Behavior Driven Design, depending on who you ask and how fond they are of particular parts of this idea). Before you add anything to your main code, you first write a test for that code's functionality and make sure the test fails in the expected way. Once it does, you can start writing code to make it pass... but you aren't allowed to write code unrelated to that specific goal.

This technique makes sure that not only is your code thoroughly tested, but also that they do what you think they do, since they must fail in the expected way before you attempt to make them pass. I've also found that it helps a great deal in scoping and designing, since you must think thoroughly about how a piece of code will be used and what output it must produce before you write it.

comment by Douglas_Knight · 2010-09-22T04:10:10.150Z · LW(p) · GW(p)

90% of spreadsheets contain errors. Sometimes I wonder if much of the purpose of spreadsheets is to get the desired answer. I believe that in fields that use spreadsheets, 90% of papers are affected by such errors. More tentatively, I believe that same rate of fields that use special-purpose code, where neq1 estimates 20%.

Replies from: andreas

↑ comment by andreas · 2010-09-22T04:55:21.016Z · LW(p) · GW(p)

90% of spreadsheets contain errors.

Source (scroll down to the last line of the first spreadsheet)

Replies from: prase

↑ comment by prase · 2010-09-22T12:09:35.329Z · LW(p) · GW(p)

It's kind of ironic that a spreadsheet is used as evidence of spreadsheets being almost always wrong. (I am not being serious.)

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2010-09-22T03:09:18.507Z · LW(p) · GW(p)

Accidentally promoted this whilst trying to add an "end summary" bar. Demoted thereafter. No offense meant, sorry for the inconvenience.

comment by [deleted] · 2010-09-22T14:54:34.760Z · LW(p) · GW(p)

After admittedly very little experience with scientific research, my basic feeling is that the scientists don't particularly care whether or not their results are affected by a coding error, just whether or not they get published. It's not that they're unlikely to do deep error checking when the result is consistent with their expectations, but that they're unlikely to do it at all.

Though it's possible that papers with unexpected results are held to higher standards by reviewers before they can get published. Which is another level of confirmation bias.

Replies from: AstroCJ

↑ comment by AstroCJ · 2010-09-22T15:16:43.581Z · LW(p) · GW(p)

Ah, medium to strong disagree. I'm not far into my scientific career in $_DISCIPLINE, but any paper introducing a new "standard code" (i.e. one that you intend to use more than once) has an extensive section explaining how their code has accurately reproduced analytic results or agreed with previous simulations in a simpler case (simpler than the one currently being analysed). Most codes seem also to be open-source, since it's good for your cred if people are writing papers saying "Using x's y code, we analyse..." which means they need to be clearly written and commented - not a guarantee against pernicious bugs, but certainly a help. This error-checking setup is also convenient for those people generating analytic solutions, since they can find something pretty and say "Oh, people can use this to test their code.".

Of course, this isn't infallible, but sometimes you have to do 10 bad simulations before you can do 1 good one.

Replies from: None, AstroCJ

↑ comment by [deleted] · 2010-09-22T15:48:31.590Z · LW(p) · GW(p)

Fluid dynamics seems to be a much more serious field than the one I was doing an REU in. None of the standard papers I read even considered supplying code. Fortunately I have found a different field of study.

Also, you have persuaded me to include code in my senior thesis. Which I admit I've also debugged in a manner similar to the one mentioned in the article... I kept fixing bugs until my polynomials stopped taking up a whole page of Mathematica output and started fitting onto one line. Usually a good sign.

↑ comment by AstroCJ · 2010-09-22T15:17:46.651Z · LW(p) · GW(p)

Except for those damned lazy biologists, of course.

comment by patrissimo · 2010-09-30T02:51:30.038Z · LW(p) · GW(p)

Write several pieces of analysis code, ideally in different languages, and check that the results are the same? Even better, have someone else replicate your analysis code. That way you have a somewhat independent source of confirmation.

Also, use practices like tons of unit testing which minimize the chance for bugs in your code. All this must be done before you see the results, of course.

Is this confirmation bias really that bad in practice? Scientists get credit for upsetting previous consensus. So this may lead potentially disruptive research to happen slightly less often. But it remains the case that an attempted change to the consensus - a "surprising" result will still get changed eventually, by someone who doesn't question the surprising result, or questions it but thoroughly reviews their code and stands by it. So evidence for change will come slightly less often than it could, but changes will still be correct. Doesn't seem like a big deal.

Science got the charge on an electron right, even after Milliken's mistake.

comment by luispedro · 2010-09-25T18:50:41.807Z · LW(p) · GW(p)

Different fields have different standards. Some are more exacting than others and require full release of source code, have standard hidden-data competitions, have a culture of reviewing the software and attempting to reproduce &c (not all of the above is applicable to all fields). Others, not so much: you publish by giving a high level description of something you coded and people believe that you did it correctly and didn't spend hours looking for the parameters that gave you the prettiest graph. Debugging by "hacking 'til the graph is publishable" is, unfortunately, too common in some of those fields.

Many scientists are completely unaware of anything other than their field and will claim that "this is the it's done" whilst they only mean "this is the way that people in my narrow sub-field do it if they want to get published".

Error detection bias in research

Contents

37 comments