This is why we can't have social science

post by Costanza · 2014-07-13T21:04:41.700Z · LW · GW · Legacy · 83 comments

Jason Mitchell is [edit: has been] the John L. Loeb Associate Professor of the Social Sciences at Harvard. He has won the National Academy of Science's Troland Award as well as the Association for Psychological Science's Janet Taylor Spence Award for Transformative Early Career Contribution.

Here, he argues against the principle of replicability of experiments in science. Apparently, it's disrespectful, and presumptively wrong.

Recent hand-wringing over failed replications in social psychology is largely pointless, because unsuccessful experiments have no meaningful scientific value.

Because experiments can be undermined by a vast number of practical mistakes, the likeliest explanation for any failed replication will always be that the replicator bungled something along the way. Unless direct replications are conducted by flawless experimenters, nothing interesting can be learned from them.

Three standard rejoinders to this critique are considered and rejected. Despite claims to the contrary, failed replications do not provide meaningful information if they closely follow original methodology; they do not necessarily identify effects that may be too small or flimsy to be worth studying; and they cannot contribute to a cumulative understanding of scientific phenomena.

Replication efforts appear to reflect strong prior expectations that published findings are not reliable, and as such, do not constitute scientific output.

The field of social psychology can be improved, but not by the publication of negative findings. Experimenters should be encouraged to restrict their “degrees of freedom,” for example, by specifying designs in advance.

Whether they mean to or not, authors and editors of failed replications are publicly impugning the scientific integrity of their colleagues. Targets of failed replications are justifiably upset, particularly given the inadequate basis for replicators’ extraordinary claims.

This is why we can't have social science. Not because the subject is not amenable to the scientific method -- it obviously is. People are conducting controlled experiments and other people are attempting to replicate the results. So far, so good. Rather, the problem is that at least one celebrated authority in the field hates that, and would prefer much, much more deference to authority.

83 comments

Comments sorted by top scores.

comment by Azathoth123 · 2014-07-13T21:55:48.425Z · LW(p) · GW(p)

The amusing thing is that Mitchel's argument proves much more than he wants it to prove.

Because experiments can be undermined by a vast number of practical mistakes, the likeliest explanation for any failed replication will always be that the replicator bungled something along the way. Unless direct replications are conducted by flawless experimenters, nothing interesting can be learned from them.

Notice that the above argument applies just as well to the original experiment being replicated.

Replies from: buybuydandavis, Richard_Kennaway, philh, Eliezer_Yudkowsky, None
comment by buybuydandavis · 2014-07-14T02:26:24.569Z · LW(p) · GW(p)

Yes, noticed.

Has anyone read his entire article? Does he attempt any justification for why this particular argument doesn't equally apply to the original experiment?

One principle I try to keep in mind is "The other guy is probably not a total moron. If it seems that way, you're probably missing something."

Replies from: SilentCal
comment by SilentCal · 2014-07-14T21:05:49.718Z · LW(p) · GW(p)

I read it. He has a section titled "The asymmetry between positive and negative evidence".

His argument is that a positive result is like seeing a black swan, and a null result is like seeing a white swan, and once you see a black swan, then no matter how many white swans you see it doesn't prove that all swans are white.

He addresses the objection that this leaves us unable to ever reject a spurious claim. His answer is that, since negative evidence is always meaningless, we should get positive evidence that the experimenter was wrong.

I think this is a fair summary of the section. It's not long, so you can check for yourself. I am... not impressed.

Replies from: buybuydandavis
comment by buybuydandavis · 2014-07-14T23:24:04.745Z · LW(p) · GW(p)

His argument is that a positive result is like seeing a black swan

Actually, it's like hearing a report of a black swan, which is why the burden of proof is generally put on the report.

It's even worse than that for him. What a bad analogy for him to rest his case on. Surely, the purpose of these social science studies is not to make a claim about the existence of some bizarre subset of the population (a black swan), but that the results will generalize to the population at large (all swans are black).

It's not long, so you can check for yourself.

That's more than enough for me.

Thanks for taking the bullet for us.

Replies from: SilentCal
comment by SilentCal · 2014-07-15T20:45:09.831Z · LW(p) · GW(p)

There's a lot wrong with the argument; he has no actual justification for assuming that social science is anything like swan-spotting.

But even within his unjustified analogy... apparently if someone reports a new color of swan in Australia, he might give polygraphs and vision tests to the reporter, but sending an expedition to Australia to check it out would be of no scientific value.

comment by Richard_Kennaway · 2014-07-14T08:59:52.159Z · LW(p) · GW(p)

Notice that the above argument applies just as well to the original experiment being replicated.

Medicine has had its Ioannidis. Is anyone doing the same for psychology and social science?

Replies from: satt, ChristianKl
comment by philh · 2014-07-14T13:22:19.972Z · LW(p) · GW(p)

My interpretation is that he's saying false positives are unlikely but false negatives are common. If that's the case, then getting a positive result should greatly affect your beliefs; but a negative replication shouldn't shift them much.

(I can't actually justify this interpretation by reading the text, but it does make the text seem fairly reasonable.)

comment by [deleted] · 2014-07-13T23:24:15.928Z · LW(p) · GW(p)

Congratulations, you're smarter than a Harvard professor. I noticed the same thing, so I'm also smarter than a Harvard professor. I would hope that everyone on this site is smarter than a Harvard professor. A Harvard professor who got his B.A. and M.S. from Yale and his Ph.D. from Harvard.

This is sociologically interesting: is it an isolated incident that a Harvard professor -- an award-winning one, no less -- would so loudly fail to comprehend how science and statistics work? That doesn't seem likely, for two reasons. First, he probably talked it over with other people in the field, or at least mentioned why he thought replication was bad and wrong, and no one talked him out of it. Second, the system he came through produced him, and it's unlikely that that sort of error would only be produced once, since education is fairly standardized. So: what does this say about the relevant institutions? (The social sciences, HYP, academia, etc.)

Replies from: Aleksander
comment by Aleksander · 2014-07-14T02:14:52.336Z · LW(p) · GW(p)

I don't think that means you are smarter than that Harvard professor. He is a very successful person and has reached heights coveted by many very smart people. It just means that the game he is playing is not one where you get ahead by saying things that make sense.

For example, if you listen to a successful politician and spot a false statement he utters, that does not mean that you are smarter than that politician.

Replies from: James_Miller
comment by James_Miller · 2014-07-14T02:31:08.761Z · LW(p) · GW(p)

Yes, academics are supposed to raise the status of their institution. This brings in money, which helps educate students, which makes the world a better place. Unsuccessful replication threatens this. Plus, replication does nothing to advance social justice.

Replies from: ChristianKl, wedrifid
comment by ChristianKl · 2014-07-14T13:29:00.036Z · LW(p) · GW(p)

His website describes his research as:

Jason employs functional neuroimaging (fMRI) and behavioral methods to study how we infer the thoughts, feelings, and opinions of others (i.e., how we mentalize) as well as how we reason about counterfactual experiences.

That doesn't seem to me like someone who focus on advancing social justice.

Replies from: gwern, James_Miller
comment by gwern · 2014-07-14T20:29:51.011Z · LW(p) · GW(p)

You should have looked at his vita for a more accurate description of his activities. If you had looked at his paper titles, some of them indicate he's not a stranger to social justice like theorizing and investigation, and likewise his funding sources, on top of Harvard's well-earned reputation: eg. "What’s in a forename?: Cue familiarity and stereotypical thinking", "Gender differences in implicit weight identity", "Deflecting negative self-relevant stereotype activation: The effects of individuation", "Me and my group: Cultural status can disrupt cognitive consistency", and the funding:

June 2007 – May 2010: National Science Foundation (BCS 0642448), "The neural basis of stereotyping", $609,800 (co-PI: Mahzarin Banaji)...September 2010 – August 2012: Templeton Foundation for Positive Neuroscience, "Vicarious Neural Response to Others as a Basis for Altruistic Behavior", $180,000 (co-PI: Jamil Zaki)

Replies from: ChristianKl
comment by ChristianKl · 2014-07-14T21:05:43.211Z · LW(p) · GW(p)

Sorry my mistake. As far as the paper titles goes.

As far as the funding goes the National Science Foundation isn't an entity that I would see as spearheading the social justice movement.

The same goes for the Templeton Foundation. They have the reputation of wanting "progress in spiritual discoveries" instead of "advancing social justice".

Replies from: gwern
comment by gwern · 2014-07-14T21:40:23.942Z · LW(p) · GW(p)

Both groups are quite big and may fund more than you think; not that the grantors always get what they think they're getting or are the only people who are then allowed to draw upon the research. For example, consider "stereotype threat", much beloved of social-justice types for explaining how bad white people keep test scores low for women and blacks; you can see the NSF certainly has been involved in that research in the past just with a cursory google: http://scholar.google.com/scholar?q=%22stereotype%20threat%22%20%22National%20Science%20Foundation%22 (I count ~4 NSF grants attested to just from the snippet-view for the first page).

Replies from: ChristianKl
comment by ChristianKl · 2014-07-14T22:11:42.017Z · LW(p) · GW(p)

Both groups are quite big and may fund more than you think; not that the grantors always get what they think they're getting or are the only people who are then allowed to draw upon the research.

I do accept that both groups do fund a large variety of courses but I still wouldn't conclude from the funding source that there a bias in the direction of the social justice movement.

When thinking about the title "The neural basis of stereotyping" you might be right that it smells like pseudoscience. It's a bit like the "The neural basis of acupuncture". Instead of searching for the neural basis it would make much more sense to focus of studying the actual effect.

I wonder if I picked up to much stereotypes about fMRI research ;)

Replies from: gwern
comment by gwern · 2014-07-15T00:01:49.134Z · LW(p) · GW(p)

I do accept that both groups do fund a large variety of courses but I still wouldn't conclude from the funding source that there a bias in the direction of the social justice movement.

I never said the NSF is biased in favor of SJ. I said if you had bothered to look at the vita instead of stopping at the most convenient place, you would have found a number of paper and grant titles which indicate a more than theoretical interest in topics strongly associated with SJ on top of his affiliation with an institute with a strong background both current and historical in liberal thought & SJ-like figures such as Cornell West and to borrow from parallel ongoing conversations, crucified Sumners for his well-founded suggestion that the elite math achievement gap might be a necessary consequence of gender differences. And then you said the NSF was evidence against SJ association, which is either wrong or weak evidence since they fund related research all the time.

Replies from: ChristianKl
comment by ChristianKl · 2014-07-15T09:35:12.825Z · LW(p) · GW(p)

The might have crucified Summers but they led him rise to be president at Harvard in the first place and didn't seem to successfully taught him not to say things like that.

Cornell West has professor for religion and then professor for African-American Studies. As far as my instincts go I wouldn't expect the same thing from Harvard department of psychology than I would expect from the department of African-American Studies.

And then you said the NSF was evidence against SJ association, which is either wrong or weak evidence since they fund related research all the time.

I didn't. I took your post as saying that I should update in the direction of him being part of the social justice movement based on his funding source and wanted to reject updating based on that information in that direction.

comment by James_Miller · 2014-07-14T14:50:07.184Z · LW(p) · GW(p)

He teaches at an elite university so it's highly probably he greatly values social justice.

Replies from: hairyfigment
comment by hairyfigment · 2014-07-14T19:28:55.457Z · LW(p) · GW(p)

Either you don't read your own comments, or you classify all social justice as signaling.

In the latter case, I'm sure you'll come up with reasons why you aren't making a fully general counterargument. I would urge you to stop for a second and ask concretely what you would have thought about, say, abolitionists.

Replies from: James_Miller
comment by James_Miller · 2014-07-14T21:11:16.030Z · LW(p) · GW(p)

I don't understand what you mean. Social justice is a label for a set of beliefs. I do not think that social justice is just about signaling. If it helps to understand what I mean, I'm a politically incorrect college professor.

Replies from: hairyfigment
comment by hairyfigment · 2014-07-14T21:30:27.937Z · LW(p) · GW(p)

academics are supposed to raise the status of their institution...He teaches at an elite university so it's highly probably he greatly values social justice.

Replies from: James_Miller
comment by James_Miller · 2014-07-14T22:01:51.429Z · LW(p) · GW(p)

Are you implying that there is a contradiction in what you quoted? If so, I don't see it.

Replies from: hairyfigment
comment by hairyfigment · 2014-07-15T01:57:08.378Z · LW(p) · GW(p)

There may indeed be a contradiction, but the real question - if we grant the first part - is what do you know and how do you know it? This seems to me like a really obvious point.

comment by wedrifid · 2014-07-20T01:04:14.957Z · LW(p) · GW(p)

Plus, replication does nothing to advance social justice.

Reasoning for this claim? Do you also believe that original research can do nothing to advance social justice? That doesn't seem likely.

comment by CronoDAS · 2014-07-15T03:12:39.307Z · LW(p) · GW(p)

And now you find a man saying that is is an irrelevant demand to expect a repeatable experiment. This is science?

-- Richard Feynman, "Cargo Cult Science"

(Yes, I am aware of the irony of appealing to authority to mock someone who says we need to defer more to established authorities.)

Replies from: None
comment by [deleted] · 2014-07-15T20:22:17.616Z · LW(p) · GW(p)

What irony? People want to have authorities that can be appealed to, so if you value epistemic rationality, just saying it's bad to appeal to authority won't work as well as capturing the drive: providing authorities who can be appealed to in order to support memes beneficial to epistemic rationality, including that it's bad to appeal to authority. If you leave the drive uncaptured, someone else can capture it.

Replies from: DaFranker
comment by DaFranker · 2014-07-28T14:24:07.755Z · LW(p) · GW(p)

It's ironic in the same way that adding the text "DEFACING STOP SIGNS" under the main text of a stop sign is ironic.

The method used is the very one which is being condemned / warned against, and the fact that it works better than other methods (in both examples) only adds to the irony, as one should assume that something that preaches not doing exactly what it's doing would invalidate itself, rather than its actual effect of producing greater results due to a quirk of humans.

comment by someonewrongonthenet · 2014-07-14T02:54:06.085Z · LW(p) · GW(p)

I sort of side with Mitchel on this.

A mentor of mine once told me that replication is useful, but not the most useful thing you could be doing because it's often better to do a followup experiment that rests on the premises established by the initial experiment. If the first experiment was wrong, the second experiment will end up wrong too. Science should not go even slower than it already does - just update and move on, don't obsess.

It's kind of how some of the landmark studies on priming failed to replicate, but there are so many followup studies which are explained by priming really well that it seems a bit silly to throw out the notion of priming just because of that.

Keep in mind, while you are unlikely to hit statistically significance where there is no real result, it's not statistically unlikely to have a real result that doesn't hit significance the next time you do it. Significance tests are attuned to get false negatives more often than false positives.

Emotionally though... when you get a positive result in breast cancer screening even when you're not at risk, you don't just shrug and say "probably a false positive" even though it is. Instead, you irrationally do more screenings and possibly get a needless operation. Similarly, when the experiment fails to replicate, people don't shrug and say "probably a false negative", even though that is, in fact, very likely. Instead, they start questioning the reputation of the experimenter. Understandably, this whole process is nerve wracking for the original experimenter. Which I think is where Mitchel was - admittedly clumsily - groping towards with the talk of "impugning scientific integrity".

Replies from: Viliam_Bur, gwern, Luke_A_Somers, Dan_Moore, DanielLC, Lumifer, eggman
comment by Viliam_Bur · 2014-07-14T06:33:14.940Z · LW(p) · GW(p)

If the first experiment was wrong, the second experiment will end up wrong too.

I guess the context is important here. If the first experiment was wrong, and the second experiment is wrong, will you publish the failure of the second experiment? Will you also publish your suspicion that the first experiment was wrong? How likely will people believe you that your results prove the first experiment was wrong, if you did something else?

Here is what the selection bias will do otherwise:

20 people will try 20 "second experiments" with p = 0,05. 19 of them will fail, one will succeed and publish the results of their successful second experiment. Then, using the same strategy, 20 people will try 20 "third experiments", and again, one of them will succeed... Ten years later, you can have dozen experiments examining and confirming the theory from dozen different angles, so the theory seems completely solid.

It's kind of how some of the landmark studies on priming failed to replicate, but there are so many followup studies which are explained by priming really well that it seems a bit silly to throw out the notion of priming just because of that.

Is there a chance that the process I described was responsible for this?

Replies from: someonewrongonthenet
comment by someonewrongonthenet · 2014-08-28T18:13:21.672Z · LW(p) · GW(p)

I guess the context is important here. If the first experiment was wrong, and the second experiment is wrong, will you publish the failure of the second experiment? Will you also publish your suspicion that the first experiment was wrong? How likely will people believe you that your results prove the first experiment was wrong, if you did something else?

In practice, individual scientists like to be able to say "my work causes updates". If you do something that rests on someone else's work and the experiment doesn't come out, you have an incentive to say "Someonewrongonthenet's hypothesis X implies A and B. Someonewrongonthenet showed A [citation], but I tried B and that means X isn't completely right.

Cue further investigation which eventually tosses out X. Whether or not A was a false positive is less important than whether or not X is right.

Is there a chance that the process I described was responsible for this?

Yes, that's possible. I'm not sure direct replication actually solves that issue, though - you'd just shift over to favoring false negatives instead false positives. The existing mechanism that works against this is the incentive to overturn other people's work.

comment by gwern · 2014-07-14T20:18:33.669Z · LW(p) · GW(p)

A mentor of mine once told me that replication is useful, but not the most useful thing you could be doing because it's often better to do a followup experiment that rests on the premises established by the initial experiment. If the first experiment was wrong, the second experiment will end up wrong too. Science should not go even slower than it already does - just update and move on, don't obsess.

Tell me, does anyone actually do what you think they should do? That is, based on a long chain of ideas A->B->C->D, none of which have been replicated, upon experimenting and learning ~Z, do they ever reject the bogus theory D? (Or wait, was it C that should be rejected, or maybe the ~Z should be rejected as maybe the experiment just wasn't powered enough to be meaningful as almost all studies are underpowered or, can you really say that Z logically entailed A...D? Maybe some other factor interfered with Z and so we can 'save the appearances' of A..Z! Yes, that's definitely it!) "Theory-testing in psychology and physics: a methodological paradox", Meehl 1967, puts it nicely (and this is as true as the day he wrote it half a century ago):

This last methodological sin is especially tempting in the "soft" fields of (personality and social) psychology, where the profession highly rewards a kind of "cuteness" or "cleverness" in experimental design, such as a hitherto untried method for inducing a desired emotional state, or a particularly "subtle" gimmick for detecting its influence upon behavioral output. The methodological price paid for this highly-valued "cuteness" is, of course, (d) an unusual ease of escape from modus tollens refutation. For, the logical structure of the "cute" component typically involves use of complex and rather dubious auxiliary assumptions,which are required to mediate the original prediction and are therefore readily available as (genuinely) plausible "outs" when the prediction fails. It is not unusual that (e) this ad hoc challenging of auxiliary hypotheses is repeated in the course of a series of related experiments,in which the auxiliary hypothesis involved in Experiment 1 (and challenged ad hoc in order to avoid the latter's modus tollens impact on the theory) becomes the focus of interest in Experiment 2, which in turn utilizes further plausible but easily challenged auxiliary hypotheses, and so forth. In this fashion a zealous and clever investigator can slowly wend his way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of "an integrated research program", without ever once refuting or corroborating so much as a single strand of the network.

To give a concrete example of why your advice is absurd and impractical and dangerous...

One of the things I am most proud of is my work on dual n-back not increasing IQ; the core researchers, in particular, the founder Jaeggi, are well-aware that their results have not replicated very well and that the results are almost entirely explained by bad control groups, and this is in part thanks to increased sample size from various followup studies which tried to repeat the finding while doing something else like an fMRI study or trying an emotional processing variant. So, what are they doing now, the Buschkuel lab and the new Jaeggi lab? Have they abandoned DNB/IQ, reasoning that since "the first experiment was wrong, the second experiment will end up wrong too"? Have they taken your advice to "just update and move on, don't obsess"? Maybe taken serious stock of their methods and other results involving benefits to working memory training in general?

No. They are now busily investigating whether individual personality differences can explain transfer or not to IQ, whether other tasks can transfer, whether manipulation motivation can moderate transfer to IQ, and so on and so forth, and reaching p<0.05 and publishing papers just like they were before; but I suppose that's all OK, because after all, "there are so many followup studies which are explained by [dual n-back transferring] really well that it seems a bit silly to throw out the notion of [dual n-back increasing IQ] just because of that".

Replies from: someonewrongonthenet, ChristianKl
comment by someonewrongonthenet · 2014-08-28T18:43:50.544Z · LW(p) · GW(p)

Wait, I'm not sure we're talking about the same thing. I'm saying direct replication isn't the most useful way to spend time. You're talking about systematic experiment design flaws.

According to your writing, the failures in this example stem from methodological issues (not using an active control group). A direct replication of the n-back-IQ transfer would have just hit p<.05 again, as it would have had the same methodological issues. Of course, if the methodological issue is not repaired, all subsequent findings will suffer from the same issues.

I'm strictly saying that direct replication isn't useful. Rigorous checking of methods and doing it over again correctly where there is a failure in the documented methodology is always a good idea.

Replies from: gwern
comment by gwern · 2014-08-29T02:09:01.401Z · LW(p) · GW(p)

But the Jaeggi cluster also sometimes use active control groups, with various kinds of differences in the intervention, metrics, and interpretations. In fact, Jaeggi was co-author on a new dual n-back meta-analysis released this month*; the meta-analysis finds the passive-active difference I did, and you know what their interpretation is? That it's due to the correlated classification of US vs international laboratories conducting particular experiments. (It never even occurred to me to classify the studies this way.) They note that sometimes psychology experiments reach different conclusions in other cultures/countries - which they do - so perhaps the lower results in American studies using active control groups is because Americans gain less from n-back training. The kindest thing I can say about this claim is that I may be able to falsify it with my larger collection of studies (they threw out or missed a lot).

So, after performing these conceptual extensions of their results - as you suggest - they continue to

...slowly wend [their] way through a tenuous nomological network, performing a long series of related experiments which appear to the uncritical reader as a fine example of "an integrated research program", without ever once refuting or corroborating so much as a single strand of the network.

So it goes.

* http://www.gwern.net/docs/dnb/2014-au.pdf / https://pdf.yt/d/VMPWmd0jpDYvZIjm / https://dl.dropboxusercontent.com/u/85192141/2014-au.pdf ; initial comments on it: https://groups.google.com/forum/#!topic/brain-training/GYqqSyfqffA

comment by ChristianKl · 2014-07-15T21:45:38.655Z · LW(p) · GW(p)

The first sentence in your dual-n-back article is:

I meta-analyze the >19 studies which measure IQ after an n-back intervention, confirming that there is a net gain of medium effect size.

If you believe that there's a net gain of medium effect size then why do you think we should throw dual n-back under the bus?

Replies from: gwern
comment by gwern · 2014-07-15T22:22:35.414Z · LW(p) · GW(p)

You should probably have read part of the second sentence: "active vs passive control groups criticism: found, and it accounts for most of the net effect size".

comment by Luke_A_Somers · 2014-07-15T15:43:43.998Z · LW(p) · GW(p)

If the first experiment was wrong, the second experiment will end up wrong too

This is not good, and I guess is not what he meant.

You design the second experiment so that it aims to find something assuming the first was right, but if the first was wrong, it can expose that too. Basically, it has to be a stronger experiment than the first one.

Replies from: someonewrongonthenet
comment by someonewrongonthenet · 2014-08-28T18:48:38.382Z · LW(p) · GW(p)

Agreed, that is a better way to say what I was trying to say.

comment by Dan_Moore · 2014-07-16T14:17:25.408Z · LW(p) · GW(p)

A mentor of mine once told me that replication is useful, but not the most useful thing you could be doing because it's often better to do a followup experiment that rests on the premises established by the initial experiment. If the first experiment was wrong, the second experiment will end up wrong too. Science should not go even slower than it already does - just update and move on, don't obsess.

If you're concerned about the velocity of scientific progress, you should also be concerned about wrong turns. A Type 1 Error (establishing a wrong result by incorrectly rejecting a null hypothesis) is, IMHO, far more damaging to science than failure to establish a correct result - possibly due to an insufficient experimental setup.

Replies from: someonewrongonthenet
comment by someonewrongonthenet · 2014-08-28T19:00:41.099Z · LW(p) · GW(p)

Yeah, there's definitely an "exploration / rigor" trade-off here (or maybe "speed / accuracy") and I'm not sure it's clear which side we are erring on right now. I'm not terribly surprised that LW favors rigor, just due to the general personality profile of the users here, and that my favoring of exploration at the cost of being wrong a few times is in the minority.

I definitely think a rational agent would be more exploratory than science currently is, but on the other hand we've got systematic biases to contend with and rigor might offset that.

comment by DanielLC · 2014-07-15T23:28:20.871Z · LW(p) · GW(p)

Emotionally though... when you get a positive result in breast cancer screening even when you're not at risk, you don't just shrug and say "probably a false positive" even though it is. Instead, you irrationally do more screenings and possibly get a needless operation.

If you get a positive result, you run another test. If you keep getting positive results, you probably have breast cancer.

Similarly, if an experiment fails to replicate, you try again. If it replicates this time, then it's probably fine. If it keeps failing to replicate, then there's a problem.

At the very least, you need to try to replicate a random sample of studies, just to make sure there aren't more false studies than you've been assuming.

Replies from: someonewrongonthenet
comment by someonewrongonthenet · 2014-08-28T18:52:53.254Z · LW(p) · GW(p)

Not an expert on cancer, but I don't think it works that way .I think the cancer test accurately tests a variable wihch is a proxy for cancer risk. So a patient who doesn't have cancer but tests positive will continue testing positive, because the variable that the cancer test measures as a proxy for cancer is elevated in that patient.

Experiments do work that way, but I'm not arguing against that. I'm only arguing that direct replication isn't a better use of resources than just going on to a followup experiment with a different methodology (unless direct replication is really easy and you can just have some students do it or something).

Replies from: DanielLC
comment by DanielLC · 2014-08-28T19:01:10.078Z · LW(p) · GW(p)

Is there only one kind of test? Couldn't they find another proxy?

I'm only arguing that direct replication isn't a better use of resources than just going on to a followup experiment with a different methodology

If the followup is testing the same thing with a different methodology, then the metaphor works. If you run followup experiments just to find more detail, it would be like someone testing positive for cancer so then you run a test for what kind of cancer. You're assuming they have cancer when you run the second test, so the results could be misleading.

Replies from: someonewrongonthenet
comment by someonewrongonthenet · 2014-08-28T19:15:08.246Z · LW(p) · GW(p)

If the followup is testing the same thing with a different methodology, then the metaphor works.

Generally an idea is considered well supported when multiple methodologies support it, yes. In the psychology lab I used to work in, at least, we never try to replicate, but we do try to show the same thing in multiple different ways. There are maybe 15 different experiments a year, but they're generally all centered around proving or dis-proving a cluster of 2 or 3 broad, conceptually linked hypotheses.

Biology labs I've worked with do often do the whole "okay, the results are in and this is established now, let's find additional detail' thing, but that's because they were usually looking at much simpler systems, like a single protein or something, so they could afford to take liberties and not be so paranoid about experimental methods.

comment by Lumifer · 2014-08-28T19:24:27.707Z · LW(p) · GW(p)

If the first experiment was wrong, the second experiment will end up wrong too.

...and now you have two problems X-)

Science should not go even slower than it already does

It's not a matter of speed, it's a matter of velocity. Going fast in the wrong direction is (much) worse than useless.

you are unlikely to hit statistically significance where there is no real result

You are quite likely. You start with a 5% chance under ideal circumstances and that chance only climbs from there. P-hacking is very widespread.

Instead, you irrationally do more screenings

8-0 You think getting additional screenings after testing positive for cancer is "irrational"??

Replies from: someonewrongonthenet
comment by someonewrongonthenet · 2014-08-28T19:53:56.375Z · LW(p) · GW(p)

The process of screening itself involves risks, not to mention the misplaced stress and possibility of unnecessary surgery.

Replies from: Lumifer
comment by Lumifer · 2014-08-28T21:17:04.682Z · LW(p) · GW(p)

This is true for e.g. any visit to the doctor. Are you saying that it's irrational to go for medical checkups?

In the cancer screening case, what do you think does the cost-benefit analysis say?

Replies from: someonewrongonthenet
comment by someonewrongonthenet · 2014-08-28T21:55:04.963Z · LW(p) · GW(p)

It would be irrational to go for medical check ups when they aren't necessary - if you did it every 3 days, for example.

I'm looking at this from a birds eye view. A lot of people get unnecessary screenings, which tell them information which is not worth acting upon no matter whether it says that it is positive or negative, and then start worrying and getting unnecessary testing and treatment. Information is only useful to the extent that you can act upon it.

Replies from: Lumifer
comment by Lumifer · 2014-08-29T06:01:51.630Z · LW(p) · GW(p)

I'm looking at this from a birds eye view.

And from up there you take it upon yourself to judge whether personal decisions are rational or not? I think you're way too far away for that.

A lot of people get unnecessary screenings

That's a different issue. In a post upstream you made a rather amazing claim that additional tests after testing positive for cancer on a screening would be irrational. Do you stand by that claim?

Replies from: someonewrongonthenet
comment by someonewrongonthenet · 2014-08-30T15:22:06.448Z · LW(p) · GW(p)

And from up there you take it upon yourself to judge whether personal decisions are rational or not? I think you're way too far away for that.

Er...I think that's a little harsh of you. Overscreening is recognized as a problem among epidemiologists. When I say overscreening is a problem, I'm mostly just trusting expert consensus on the matter.

That's a different issue. In a post upstream you made a rather amazing claim that additional tests after testing positive for cancer on a screening would be irrational. Do you stand by that claim?

I stand by that a lot of smart people who study this issue believe that in actual medical practice, these screenings are either a problem in themselves, or that the information from the screenings can lead people to irrational behavior, and I do trust them.

But really, that was just an illustrative example used to steelman Michael. You don't have to except the actual example, just the general concept that this sort of thing can happen.

Replies from: Lumifer
comment by Lumifer · 2014-09-02T18:38:45.148Z · LW(p) · GW(p)

Overscreening is recognized as a problem among epidemiologists.

Rationality does not specify values. I rather suspect that the cost-benefit analysis that epidemiologists look at is quite different from the cost-benefit analysis that individuals look at.

these screenings are either a problem in themselves, or that the information from the screenings can lead people to irrational behavior

LOL. Don't bother you pretty little head with too much information. No, you don't need to know that. No, you can't decide what you need to know and what you don't need to know. X-/

comment by eggman · 2014-07-14T15:42:01.579Z · LW(p) · GW(p)

Scientists as community of humans should expect there research to return false positives sometimes, because that is what is going to happen, and they should publish those results. Scientists should also expect experiments to demonstrate that some of their hypotheses are just plain wrong. It seems to me replication is only not very useful if the replications of the experiment are likely prone to all the same crap that currently makes original experiments from social psychology not all that reliable. I don't have experience, or practical knowledge of the field, though, so I wouldn't know.

comment by James_Miller · 2014-07-13T21:30:04.537Z · LW(p) · GW(p)

When natural scientists attempt to replicate famous experiments where the original result was clearly correct, with what probability do they tend to succeed? Is it closer to 1 than, say, .7?

Replies from: Costanza, satt
comment by Costanza · 2014-07-13T21:51:40.012Z · LW(p) · GW(p)

I'd think that "famous experiments where the original result was clearly correct" are exactly those whose results have already been replicated repeatedly. If they haven't been replicated they may well be famous -- Stanford prison experiment, I'm looking at you -- but they aren't clearly correct.

Replies from: James_Miller
comment by James_Miller · 2014-07-14T01:47:42.305Z · LW(p) · GW(p)

I was thinking more "What is the error rate in replication experiments when we know the results from the original experiment were correct?" So if mixing X and Y under certain conditions has to yield Z, how often when scientists actually try to do this do they get Z?

Replies from: Lightwave
comment by Lightwave · 2014-07-14T09:20:39.017Z · LW(p) · GW(p)

The error rate in replication experiments in the natural sciences is expected to be much much lower than in the social sciences. Humans and human environments are noisy and complicated. Look at nutrition/medicine - it's taking us decades to figure out whether some substance/food is good or bad for you and under what circumstances. Why would you expect it be easier to analyze human psychology and behavior?

Replies from: ChristianKl
comment by ChristianKl · 2014-07-14T13:31:01.618Z · LW(p) · GW(p)

If you want to know whether food is good or bad you have to look at mortality which means you might have to wait a decade.

A lot of psychology experiments claim effects over much shorter timeframes.

Replies from: LimberLarry
comment by LimberLarry · 2014-07-17T08:48:41.297Z · LW(p) · GW(p)

I think he is more suggesting that the number of confounding factors in psychology experiments is generally far higher than in the natural sciences. The addition of such uncontrollable factors leads to a generally higher error rate in human sciences.

Replies from: ChristianKl
comment by ChristianKl · 2014-07-17T10:02:14.822Z · LW(p) · GW(p)

The number of confounding factors isn't that important if it's possible to do controlled experiments that control for them. Nutrition science has the problem that you usually can't do good controlled experiments or those are very expensive.

Replies from: LimberLarry
comment by LimberLarry · 2014-07-18T06:01:18.368Z · LW(p) · GW(p)

Obviously if you can control for a confounding factor then its not an issue, I was simply stressing that the nature of human sciences means that it is effectively impossible to control for all confounding factors, or even be aware of many of them.

Replies from: ChristianKl
comment by ChristianKl · 2014-07-18T07:30:03.855Z · LW(p) · GW(p)

To the extend that's true careful replication of studies to identify factors is important if you don't want to practice what Feymann described as Cargo Cult science. If you follow Feymann argument physicists also would get a bunch of bad results if they would work with the scientific standards used in psychology.

Feymann on rat psychology:

All experiments in psychology are not of this type, however. For example, there have been many experiments running rats through all kinds of mazes, and so on--with little clear result. But in 1937 a man named Young did a very interesting one. He had a long corridor with doors all along one side where the rats came in, and doors along the other side where the food was. He wanted to see if he could train the rats to go in at the third door down from wherever he started them off. No. The rats went immediately to the door where the food had been the time before.

The question was, how did the rats know, because the corridor was so beautifully built and so uniform, that this was the same door as before? Obviously there was something about the door that was different from the other doors. So he painted the doors very carefully, arranging the textures on the faces of the doors exactly the same. Still the rats could tell. Then he thought maybe the rats were smelling the food, so he used chemicals to change the smell after each run. Still the rats could tell. Then he realized the rats might be able to tell by seeing the lights and the arrangement in the laboratory like any commonsense person. So he covered the corridor, and still the rats could tell.

He finally found that they could tell by the way the floor sounded when they ran over it. And he could only fix that by putting his corridor in sand. So he covered one after another of all possible clues and finally was able to fool the rats so that they had to learn to go in the third door. If he relaxed any of his conditions, the rats could tell.

Now, from a scientific standpoint, that is an A-number-one experiment. That is the experiment that makes rat-running experiments sensible, because it uncovers that clues that the rat is really using-- not what you think it's using. And that is the experiment that tells exactly what conditions you have to use in order to be careful and control everything in an experiment with rat-running.

I looked up the subsequent history of this research. The next experiment, and the one after that, never referred to Mr. Young. They never used any of his criteria of putting the corridor on sand, or being very careful. They just went right on running the rats in the same old way, and paid no attention to the great discoveries of Mr. Young, and his papers are not referred to, because he didn't discover anything about the rats.

Nutrition is really a different case than a lot of psychology. There are question in psychology such as whether doing certain things to a child in it's childhood effect whether that child is a healthy adult. Those questions are hard to investigate scientifically because of time lag. The same isn't true for many psychology experiments.

Replies from: LimberLarry
comment by LimberLarry · 2014-07-18T10:10:00.649Z · LW(p) · GW(p)

I don't think we actually disagree on anything, the only point I was making was that your reply to Lightwave, while accurate, wasn't actually replying to the point he made.

Replies from: ChristianKl
comment by ChristianKl · 2014-07-18T12:20:52.954Z · LW(p) · GW(p)

I did reply to his point. He spoke about nutrition science. That field has it's own problems that psychologists don't have to deal with. It's a bad example if he wanted to make the point you think he wanted to make.

Replies from: LimberLarry
comment by LimberLarry · 2014-07-18T22:40:35.100Z · LW(p) · GW(p)

Well fair enough. His use of nutrition science as an example was probably poorly chosen.

comment by satt · 2014-07-16T03:41:55.539Z · LW(p) · GW(p)

I've suggested on LW before that most attempts at physics experiments are wrong, if one counts physics students' attempts. The standard reaction to a student getting a counterintuitive result is, "well, obviously they messed up the experiment". I notice I feel OK with that response in the case of physics but don't like Mitchell trying it for psychology.

(I wonder whether biology students have to count chromosomes.)

Replies from: Pfft
comment by Pfft · 2014-07-16T16:32:50.436Z · LW(p) · GW(p)

Students are particularly bad at experimentation (which is why they have to take those labs in the first place), and the experiments they do are selected for being particularly fundamental and well-understood (in particular, they have already been replicated lots of times). I think this is a more important difference than physics versus psychology.

comment by jsteinhardt · 2014-07-13T23:29:38.260Z · LW(p) · GW(p)

Rather, the problem is that at least one celebrated authority in the field hates that, and would prefer much, much more deference to authority.

I don't think this is true at all. His points against replicability are very valid and match my experience as a researcher. In particular:

Because experiments can be undermined by a vast number of practical mistakes, the likeliest explanation for any failed replication will always be that the replicator bungled something along the way.

This is a very real issue and I think that if we want to solve the current issues with science we need to be honest about this, rather than close our eyes and repeat the mantra that replication will solve everything. And it's not like he's arguing against accountability. Even in your quoted passage he says:

The field of social psychology can be improved, but not by the publication of negative findings. Experimenters should be encouraged to restrict their “degrees of freedom,” for example, by specifying designs in advance.

Now, I think he goes too far by saying that no negative findings should be published; but I think they need to be held to a high standard for the very reason he gives. On the other hand, positive findings should also be held to a higher standard.

Note that there are people much wiser than me (such as Andrew Gelman) who disagree with me; Gelman is dissatisfied with the current presumption that published research is correct. I certainly agree with this but for the same reasons that Mitchell gives, I don't think that merely publishing negative results can fix this issue.

Either way, I think you are being quite uncharitable to Mitchell.

Replies from: Costanza, ThisSpaceAvailable, Vaniver, CarlShulman
comment by Costanza · 2014-07-14T00:14:35.335Z · LW(p) · GW(p)

What is the purpose of an experiment in science? For instance, in the field of social psychology? For instance,what is the current value of the Milgram experiment? A few people in Connecticut did something in a room at Yale in 1961. Who cares? Maybe it's just gossip from half a century ago.

However, some people would have us believe that this experiment has broader significance, beyond the strict parameters of the original experiment, and has implications for (for example) the military in Texas and corporations in California.

Maybe these people are wrong. Maybe the Milgram experiment was a one-off fluke. If so, then let's stop mentioning it in every intro to psych textbook. While we're at it, why the hell was that experiment funded, anyway? Why should we bother funding any further social psychology experiments?

I would have thought, though, that most social psychologists would believe that the Milgram experiment has predictive significance for the real world. A Bayesian who knows about the results of the Milgram experiment should better be able to anticipate what happens in the real world. This is what an experiment is for. It changes your expectations.

However, if a supposedly scientific experiment does nothing at all to alter your expectations, it has told you nothing. You are just as ignorant as you were before the experiment. It was a waste.

Social psychology purports to predict what will happen in the real world. This is what would qualify it as a science. Jason Mitchell is saying it cannot even predict what will happen in a replicated experiment. In so doing, he is proclaiming to the world that he personally has learned nothing from the experiments of social psychology. He is ignorant of what will happen if the experiment is replicated. I am not being uncharitable to Mitchell. He is rejecting the foundations of his own field. He is not a scientist.

Replies from: fubarobfusco
comment by fubarobfusco · 2014-07-14T07:28:25.711Z · LW(p) · GW(p)

Maybe the Milgram experiment was a one-off fluke.

http://en.wikipedia.org/wiki/Milgram_experiment#Replications_and_variations

In Obedience to Authority: An Experimental View (1974), Milgram describes nineteen variations of his experiment [...]

(emphasis added)

comment by ThisSpaceAvailable · 2014-07-14T07:23:31.548Z · LW(p) · GW(p)

Because experiments can be undermined by a vast number of practical mistakes, the likeliest explanation for any failed replication will always be that the replicator bungled something along the way.

This is a very real issue and I think that if we want to solve the current issues with science we need to be honest about this, rather than close our eyes and repeat the mantra that replication will solve everything.

Why is it more likely that the followup experiment was flawed, rather than the original? Are we giving a prior of > 50% to every hypothesis that a social scientist comes up with?

comment by Vaniver · 2014-07-14T00:47:38.774Z · LW(p) · GW(p)

Either way, I think you are being quite uncharitable to Mitchell.

I disagree. Let's look at this section again:

Whether they mean to or not, authors and editors of failed replications are publicly impugning the scientific integrity of their colleagues. Targets of failed replications are justifiably upset, particularly given the inadequate basis for replicators’ extraordinary claims.

Contrast this to:

“This been difficult for me personally because it’s an area that’s important for my research,” he says. “But I choose the red pill. That’s what doing science is.”

From here, linked before on LW here.

The first view seems to have the implied assumption that false positives don't happen to good researchers, whereas the second view has the implied assumption that theories and people are separate, and people should follow the facts, rather than the other way around.


But perhaps it is the case that, in social psychology, the majority of false positives are not innocent, and thus when a researchers results do not replicate it is a sign that they're dishonest rather than that they're unlucky. In such a case, he is declaring that researchers should not try to expose dishonesty, which should bring down opprobrium from all decent people.

Replies from: Dan_Moore
comment by Dan_Moore · 2014-07-14T13:43:24.412Z · LW(p) · GW(p)

The goal is to set up the experiments to make it solely about the results and not about colleagues. If 'scientific integrity' means sloppy, porous experimental setup, then impugning this is not a bad thing. Ideally the experimental design and execution should transcend the question of the researchers' motives.

comment by CarlShulman · 2014-07-14T18:21:51.458Z · LW(p) · GW(p)

Because experiments can be undermined by a vast number of practical mistakes, the likeliest explanation for any failed replication will always be that the replicator bungled something along the way

Do you agree with the empirical claim about the frequencies of false positives in initial studies versus false negatives in replications?

comment by Aleksander · 2014-07-14T02:11:11.919Z · LW(p) · GW(p)

This is why we can't have social science. Not because the subject is not amenable to the scientific method -- it obviously is. People are conducting controlled experiments and other people are attempting to replicate the results. So far, so good.

So, you say people are trying the scientific approach. My guess is, the nature of the problem is such that nothing much came out of these attempts. No great insights were gained, no theories were discovered. Real scientists had nothing to show for their efforts, and this is why the these fields are now not owned by real scientists, but by people with other skills.

For an imperfect analogy, say an ancient civilization left a lot of incomprehensible texts. Say a lot of effort went into deciphering these texts, and to everyone's surprise it was discovered that some of these texts were English encrypted with some cyphers that the cryptographers cracked. So people skilled in the mathematics of cryptography started working on other texts, and many of them were consequently decrypted. But there were some classes of documents that the cryptographers were helpless against. Maybe these texts were really random garbage, or maybe they were encrypted with cyphers which are well beyond the current reach of our technology.

Now imagine a crook comes and says he has found the true meaning of a text. He calls himself a cryptographer and produces an "explanation" which is drivel but which superficially sounds like what the cryptographers say. If he tries that with one of the texts that have been decrypted for real, it will be hard for him to compete with the actual solution. But for the texts that remain a mystery, crooks will be the only game in town.

Replies from: Nornagest
comment by Pfft · 2014-07-16T16:46:22.554Z · LW(p) · GW(p)

I think someone should mention Harry Collins and Trevor Pinch's book The Golem here. It's a collection of episodes from the history of science. The general theme is that in practice, new discoveries do not involve a clear-cut observation followed by theorizing, instead there is a lot of squabbling over whether the researchers involved carried out their experiments correctly, and these kind of feuds can persist for a scientific generation.

My view is that this makes replication attempts all the more important. But it also shows that some resistance and recriminations and mudslinging is probably to be expected---because the human status games of "are you a good enough scientist that we can really trust you" is a very integral part of how humans do science.

comment by MrCogmor · 2014-07-14T09:38:50.872Z · LW(p) · GW(p)

In the second paragraph of the quote the author ignores the whole point of replication efforts. We know that scientific studies may suffer from methodological errors. The whole point of replication studies are to identify methodological errors. If they disagree then you know there is an uncontrolled variable or methodological mistake in one or both of them, further studies and the credibility of the experimenters is then used to determine which result is more likely to be true. If the independent studies agree then it is evidence that they are both correct.

The author also argues that replication efforts are biased because they are mostly made by people who disagree with the original study. That seems like a valid point.

Specifying designs in advance is a good idea, though not orignal

Replies from: DanArmak
comment by DanArmak · 2014-07-14T19:25:46.665Z · LW(p) · GW(p)

The author also argues that replication efforts are biased because they are mostly made by people who disagree with the original study. That seems like a valid point.

Original, non-replication studies are mostly made by people who agree with what their studies are showing. (Also, publication bias.) So this is not a reason to think replication studies are particularly biased.

comment by HungryHobo · 2014-07-15T16:24:56.597Z · LW(p) · GW(p)

While I agree that this guy needs to hand in his "Scientist" card this is an individual who no more reflects on his field than any other individual does on theirs.

There was a notable climate scientist whose response to people asking for his data was literally "no, you'll just try to use it to prove me wrong".

Edit: exact quote:"Even if WMO agrees, I will still not pass on the data. We have 25 or so years invested in the work. Why should I make the data available to you, when your aim is to try and find something wrong with it." http://www.theregister.co.uk/2009/08/13/cru_missing/

That doesn't make all climate scientists kooks any more than this guy makes all social scientists kooks.

There will always be people who manage to get research grants despite not really getting the whole point of science.

People who view being "right" as more important than having the correct answer.

People who respond to having authority by deciding that questioning authority, questioning them, is wrong and bad.

Replies from: waveman
comment by waveman · 2014-12-03T04:45:19.339Z · LW(p) · GW(p)

There was a notable climate scientist who's response to people asking for hist data was literally "no, you'll just try to use it to prove me wrong".

You were down-voted in part because of bad grammar (who's=>whose) and the absence of citations.

Replies from: HungryHobo
comment by HungryHobo · 2014-12-03T15:44:08.374Z · LW(p) · GW(p)

edited