How do you assess the quality / reliability of a scientific study?

elityre

How do you assess the quality / reliability of a scientific study?

post by Eli Tyre (elityre) · 2019-10-29T14:52:57.904Z · LW · GW · 12 comments

This is a question post.

  Answers
    53 waveman
    31 Unnamed
    27 Bucky
    26 jimrandomh
    21 Elizabeth
    17 Kristin Lindquist
    12 romeostevensit
    11 Charlie Steiner
    9 whales
    6 Elizabeth
    4 TheMajor
    3 Mary Chernyshenko
None
12 comments

When you look at a paper, what signs cause you to take it seriously? What signs cause you to discard the study as too poorly designed to be much evidence one way or the other?

I'm hoping to compile a repository of heuristics on study evaluation, and would love to hear people's tips and tricks, or their full evaluation-process.

I'm looking for things like...

"If the n (sample size) is below [some threshold value], I usually don't pay much attention."
"I'm mostly on the lookout for big effect sizes."
"I read the abstract, then I spend a few minutes thinking about how I would design the experiment, including which confounds I would have to control for, and how I could do that. Then I read the methods section, and see how their study design compares to my 1-3 minute sketch. Does their design seem sensible? Are they accounting for the first-order-obvious confounds?"
etc.

Answers

answer by waveman · 2019-10-30T02:09:40.846Z · LW(p) · GW(p)

I've probably read about 1000 papers. Lessons learned the hard way...

1. Look at the sponsorship of the research and of the researchers (previous sponsorship, "consultancies" etc are also important for up to 10-15 years). This creates massive bias. E.g: A lot of medical bodies and researchers are owned by pharmaceutical companies

2. Look at ideological biases of the authors. E.g. a lot of social science research assumes as a given that genes have no effect on personality or intelligence. (Yes, really).

3. Understand statistics very deeply. There is no pain-free way to get this knowledge, but without it you cannot win here. E.g. a) The assumptions behind all the statistical models b) the limitations of alleged "corrections". You need to understand both Bayesian and Frequentist statistics in depth, to the point that they are obvious and intuitive to you.

4. Understand how researchers rig results. e.g. undisclosed multiple comparisons, peeking at the data before deciding what analysis to do, failing to pre-publish the design and end points and to follow that pre-publication, "run-in periods" for drug trials, sponsor-controlled committees to review and change diagnoses... There are papers about this e.g. "why most published research findings are false".

5. After sponsorship, read the methods section carefully. Look for problems. Have valid and appropriate statistics been used? Were the logical end points assessed? Maybe then look at the conclusions. Do the conclusions match the body of the paper? Has the data from the study been made available to all qualified researchers to check the analysis? Things can change a lot when that happens e.g. Tamiflu. Is the data is only available to commercial interests and their stooges this is a bad sign.

6. Has the study been replicated by independent researchers?

7. Is the study observational? If so, does is meet generally accepted criteria for valid observational studies? (large effect, dose-response gradient, well understood causal model, well understood confounders, confounders smaller than the published effect etc).

8. Do not think you can read abstracts only and learn much that is useful.

9. Read some of the vitriolic books about the problems in research e.g. "Deadly Medicines and Organised Crime How big pharma has corrupted healthcare" by PETER C GØTZSCHE. Not everything in this book is true but it will open your eyes about what can happen.

10. Face up to the fact that 80-90% of studies are useless or wrong. You will spend a lot of time reading things only to conclude that there is not much there.

↑ comment by waveman · 2019-10-30T02:10:40.938Z · LW(p) · GW(p)

One of the most miserable things about the LW experience is realizing how little you actually know with confidence.

↑ comment by Ruby · 2019-10-30T05:46:17.605Z · LW(p) · GW(p)

I've probably read about 1000 papers. Lessons learned the hard way...

Very cool. How have these been split across different fields/domains?

Replies from: waveman

↑ comment by waveman · 2019-12-03T03:17:16.580Z · LW(p) · GW(p)

Mostly medicine, nutrition, metabolism. Also finance and economics.

↑ comment by ChristianKl · 2019-11-02T11:59:13.794Z · LW(p) · GW(p)

What kind of experiences were the hard lesson? How did the moments of learning look like?

Replies from: waveman

↑ comment by waveman · 2019-12-03T03:17:58.786Z · LW(p) · GW(p)

Mostly belatedly realizing that studies I took as Gospel turned out to be wrong. This triggered an intense desire to know why and how.

↑ comment by iterativecode · 2019-10-31T14:29:18.768Z · LW(p) · GW(p)

This is a great answer and should be taught to everyone.

↑ comment by Alex K. Chen (parrot) (alex-k-chen) · 2020-11-11T04:39:26.232Z · LW(p) · GW(p)

Is there an online way to better tag which studies are suspect and which ones aren't - for the sake of everyone else who reads after?

Replies from: Ruby, alexey-lapitsky

↑ comment by Ruby · 2020-11-12T23:56:40.840Z · LW(p) · GW(p)

Check out PubPeer.

↑ comment by Alexey Lapitsky (alexey-lapitsky) · 2020-11-11T20:48:30.430Z · LW(p) · GW(p)

I am using https://scite.ai/ with a plugin for browsers, but I would love a similar service with user-generated flags.

↑ comment by Mary Chernyshenko (mary-chernyshenko) · 2019-12-02T17:23:45.957Z · LW(p) · GW(p)

(a minor thing - I used to have a separate MSWord file with a table for "techniques". Some people prefer Excel and so on, but I find that Word helps me keep it laconic. The columns were: Species; Purpose; Fixation/Storage; Treatment; and Reference (with a hyperlink). Within Treatment I just highlighted specific terms. Very easy to see something out of the ordinary.)

answer by Unnamed · 2019-11-12T22:38:05.665Z · LW(p) · GW(p)

Context: My experience is primarily with psychology papers (heuristics & biases, social psych, and similar areas), and it seems to generalize pretty well to other social science research and fields with similar sorts of methods.

One way to think about this is to break it into three main questions:

1. Is this "result" just noise? Or would it replicate?

2. (If there's something besides noise) Is there anything interesting going on here? Or are all the "effects" just confounds, statistical artifacts, demonstrating the obvious, etc.

3. (If there is something interesting going on here) What is going on here? What's the main takeaway? What can we learn from this? Does it support the claim that some people are tempted to use it to support?

There is some benefit just to explicitly considering all three questions, and keeping them separate.

For #1 ("Is this just noise?") people apparently do a pretty good job of predicting which studies will replicate. Relevant factors include:

1a. How strong is the empirical result (tiny p value, large sample size, precise estimate of effect size, etc.).

1b. How plausible is this effect on priors? Including: How big an effect size would you expect on priors? And: How definitively does the researchers' theory predict this particular empirical result?

1c. Experimenter degrees of freedom / garden of forking paths / possibility of p-hacking. Preregistration is best, visible signs of p-hacking are worst.

1d. How filtered is this evidence? How much publication bias?

1e. How much do I trust the researchers about things like (c) and (d)?

I've found that this post on how to think about whether a replication study "failed" also seems to have helped clarify my thinking about whether a study is likely to replicate.

If there are many studies of essentially the same phenomenon, then try to find the methodologically strongest few and focus mainly on those. (Rather than picking one study at random and dismissing the whole area of research if that study is bad, or assuming that just because there are lots of studies they must add up to solid evidence.)

If you care about effect size, it's also worth keeping in mind that the things which turn noise into "statistically significant results" also tend to inflate effect sizes.

For #2 ("Is there anything interesting going on here?"), understanding methodology & statistics is pretty central. Partly that's background knowledge & expertise that you keep building up over the years, partly that's taking the time & effort to sort out what's going on in this study (if you care about this study and can't sort it out quickly), sometimes you can find other writings which comment on the methodology of this study which can help a lot. You can try googling for criticisms of this particular study or line of research (or check google scholar for papers that have cited it), or google for criticisms of specific methods they used. It is often easier to recognize when someone makes a good argument than to come up with that argument yourself.

One framing that helps me think about a study's methodology (and whether or not there's anything interesting going on here) is to try to flesh out "null hypothesis world": in the world where nothing interesting is going on, what would I expect to see come out of this experimental process? Sometimes I'll come up with more than one world that feels like a null hypothesis world. Exercise: try that with this study (Egan, Santos, Bloom 2007). Another exercise: Try that with the hot hand effect.

#3 ("What is going on here?") is the biggest/broadest question of the three. It's the one that I spend the most time on (at least if the study is any good), and it's the one that I could most easily write a whole bunch about (making lots of points and elaborating on them). But it's also the one that is the most distant from Eli's original question, and I don't want to turn those post into a big huge essay, so I'll just highlight a few things here.

A big part of the challenge is thinking for yourself about what's going on and not being too anchored on how things are described by the authors (or the press release or the person who told you about the study). Some moves here:

3a. Imagine (using your inner sim) being a participant in the study, such that you can picture what each part of the study was like. In particular, be sure that you understand every experimental manipulation and measurement in concrete terms (okay, so then they filled out this questionnaire which asked if you agree with statements like such-and-such and blah-blah-blah).

3b. Be sure you can clearly state the pattern of results of the main finding, in a concrete way which is not laden with the authors' theory (e.g. not "this group was depleted" but "this group gave up on the puzzles sooner"). You need this plus 3a to understand what happened in the study, then from there you're trying to draw inferences about what the study implies.

3c. Come up with (one or several) possible models/theories about what could be happening in this study. Especially look for ones that seem commonsensical / that are based in how you'd inner sim yourself or other people in the experimental scenario. It's fine if you have a model that doesn't make a crisp prediction, or if you have a theory that seems a lot like the authors' theory (but without their jargon). Exercise: try that with a typical willpower depletion study.

3d. Have in mind the key takeaway of the study (e.g., the one sentence summary that you would tell a friend; this is the thing that's the main reason why you're interested in reading the study). Poke at that sentence to see if you understand what each piece of it means. As you're looking at the study, see if that key takeaway actually holds up. e.g., Does the main pattern of results match this takeaway or do they not quite match up? Does the study distinguish the various models that you've come up with well enough to strongly support this main takeaway? Can you edit the takeaway claim to make it more precise / to more clearly reflect what happened in the study / to make the specifics of the study unsurprising to someone who heard the takeaway? What sort of research would it take to provide really strong support for that takeaway, and how does the study at hand compare to that?

3e. Look for concrete points of reference outside of this study which resemble the sort of thing the researchers are talking about. Search in particular for ones that seem out-of-sync with this study. e.g., This study says not to tell other people your goals, but the other day I told Alex about something I wanted to do and that seemed useful; do the specifics of this experiment change my sense of whether that conversation with Alex was a good idea?

Some narrower points which don't neatly fit into my 3-category breakdown:

A. If you care about effect sizes then consider doing a Fermi estimate, or otherwise translating the effect size into numbers that are intuitively meaningful to you. Also think about the range of possible effect sizes rather than just the point estimate, and remember that the issues with noise in #1 also inflate effect size.

B. If the paper finds a null effect and claims that it's meaningful (e.g., that the intervention didn't help) then you do care about effect sizes. (e.g., If it claims the intervention failed because it had no effect on mortality rates, then you might assume a value of $10M per life and try to calculate a 95% confidence interval on the value of the intervention based solely on its effect on mortality.)

C. New papers that claim to debunk an old finding are often right when they claim that the old finding has issues with #1 (it didn't replicate) or #2 (it had methodological flaws) but are rarely actually debunkings if they claim that the old finding has issues with #3 (it misdescribes what's really going on). The new study on #3 might be important and cause you to change your thinking in some ways, but it's generally an incremental update rather than a debunking. Examples that look to me like successful debunkings: behavioral social priming research (#1), the Dennis-dentist effect (#2), the hot hand fallacy (#2 and some of B), the Stanford Prison Experiment (closest to #2), various other things that didn't replicate (#1). Examples of alleged "debunkings" which seem like interesting but overhyped incremental research: the bystander effect [LW(p) · GW(p)] (#3), loss aversion (this study) (#3), the endowment effect (#3).

answer by Bucky · 2019-11-12T16:28:22.875Z · LW(p) · GW(p)

Often I want to form a quick impression as to whether it is worth me analysing a given paper in more detail. A couple of quick calculations can go a long way. Some of this will be obvious but I've tried to give the approximate thresholds for the results which up until now I've been using subconsciously. I'd be very interested to hear other people's thresholds.

Calculations

Calculate how many p-values (could) have been calculated.

If the study and analysis techniques were pre-registered then count how many p-values were calculated.
If the study was not pre-registered, calculate how many different p-values could have been calculated (had the data looked different) which would have been equally justified as the ones that they did calculate (see Gelman’s garden of forking paths). This depends on how aggressive any hacking has been but roughly speaking I'd calculate:

Number of input variables (including interactions) x Number of measurement variables

Calculate expected number of type I errors

Multiply answer from previous step by the threshold p-value of the paper
Different results may have different thresholds which makes life a little more complicated

Estimate Cohen’s d for the experiment (without looking at the actual result!)

One option in estimating effect size is to not consider the specific intervention, but just to estimate how easy the target variable is to move for any intervention – see putanumonit for a more detailed explanation. I wouldn't completely throw away my prior on how effective the particular intervention in question is, but I do consider it helpful advice to not let my prior act too powerfully.

Calculate experimental power

You can calculate this properly but alternatively can use Lehr’s formula. Sample size equations for different underlying distributions can be found here.
To get Power > 0.8 we require sample size per group of:

$N > \frac{16}{C o h e n^{'} s d^{2}}$
This is based on $p_{t h r e s h o l d} = 0.05$ , single p-value calculated, 2 samples of equal size, 2 tailed t-test.

A modification to this rule to account for multiple p-values would be to add 3.25 to the numerator for each doubling of the number of p-values calculated previously.

$N > \frac{16 + 3.25 \times l o g_{2} (n_{p - v a l u e s})}{C o h e n^{'} s d^{2}}$

If sample sizes are very unequal (ratio of >10) then the number required in the smaller sample is the above calculation divided by 2. This also works for single sample tests against a fixed value.

Thresholds

Roughly speaking, if expected type I errors is above 0.25 I’ll write the study off, between 0.05 and 0.25 I’ll be suspicious. If multiple significant p-values are found this gets a bit tricky due to non-independence of the p-values so more investigation may be required.

If sample size is sufficient for power > 0.8 then I’m happy. If it comes out below then I’m suspicious and have to check whether my estimation for Cohen’s d is reasonable. If I'm still convinced N is a long way from being large enough I'll write the study off. Obviously as the paper has been published the calculated Cohen’s d is large enough to get a significant result but the question is do I believe that the effect size calculated is reasonable.

Test

I tried Lehr’s formula on the 80,000 hours replication quiz. Of the 21 replications, my calculation gave a decisive answer in 17 papers, getting them all correct - 9 studies with comfortably oversized samples replicated successfully, 8 studies with massively undersized samples (less than half the required sample size I calculated) failed to replicate. Of the remaining 4 where the sample sizes were 0.5 – 1.2 x my estimate from Lehr’s equation, all successfully replicated.

(I remembered the answer to most of the replications but tried my hardest to ignore this when estimating Cohen's d.)

Just having a fixed minimum N wouldn’t have worked nearly as well – of the 5 smallest studies only 1 failed to replicate.

↑ comment by Bucky · 2019-11-18T12:18:17.497Z · LW(p) · GW(p)

I just came across an example of this which might be helpful.

Good grades and a desk 'key for university hopes' (BBC News)

Essentially getting good grades and having a desk in your room are apparently good predictors of whether you want to go to university or not. The former seemed sensible, the latter seemed like it shouldn't have a big effect size but I wanted to give it a chance.

The paper itself is here.

Just from the abstract you can tell there are at least 8 input variables so the numerator on Lehr's equation becomes ~26. This means a cohen's d of 0.1 (which I feel is pretty generous for having a desk in your room) would require 2600 results in each sample.

As the samples are unlikely to be of equal size, I would estimate they would need a total of ~10,000 samples for this to have any chance of finding a meaningful result for smaller effect sizes.

The actual number of samples was ~1,000. At this point I would normally write off the study without bothering to go deeper, the process taking less than 5 minutes.

I was curious to see how they managed to get multiple significant results despite the sample size limitations. It turns out that they decided against reporting p-values because "we could no longer assume randomness of the sample". Instead they report the odds ratio of each result and said that anything with a large ratio had an effect, ignoring any uncertainty of the results.

It turns out there were only 108 students in the no-desk sample. Definitely what Andrew Gelman calls a Kangaroo measurement.

There are a lot of other problems with the paper but just looking at the sample size (even though the sample size was ~1,000) was a helpful check to confidently reject the paper with minimal effort.

↑ comment by Bucky · 2019-11-18T16:27:01.588Z · LW(p) · GW(p)

Additional thoughts:

1. For reasonable assumptions if you're studying an interaction then you might need 16x larger samples - see Gelman. Essentially standard error is double for interactions and Andrew thinks that interaction effects being half the size of main effects is a good starting point for estimates, giving $(2 \times 2)^{2} = 16$ times larger samples.

2. When estimating cohen's d, it is important that you know whether the study is between or within subjects - within subject studies will give much lower standard error and thus require much smaller samples. Again Gelman discusses.

answer by jimrandomh · 2019-11-02T06:25:58.214Z · LW(p) · GW(p)

1. For health-related research, one of the main failure modes I've observed when people I know try to do this, is tunnel vision and a lack of priors about what's common and relevant. Reading raw research papers before you've read broad-overview stuff will make this worse, so read UpToDate first and Wikipedia second. If you must read raw research papers, find them with PubMed, but do this only rarely and only with a specific question in mind.

2. Before looking at the study itself, check how you got there. If you arrived via a search engine query that asked a question or posed a topic without presupposing an answer, that's good; if there are multiple studies that say different things, you've sampled one of them at random. If you arrived via a query that asked for confirmation of a hypothesis, that's bad; if there are multiple studies that said different things, you've sampled in a way that was biased towards that hypothesis. If you arrived via a news article, that's the worst; if there are multiple studies that said different things, you sampled in a way that was biased opposite reality.

3. Don't bother with studies in rodents, animals smaller than rodents, cell cultures, or undergraduate psychology students. These studies are done in great numbers because they are cheap, but they have low average quality. The fact that they are so numerous makes the search-sampling problems in (2) more severe.

4. Think about what a sensible endpoint or metric would be before you look at what endpoint/metric was reported. If the reported metric is not the metric you expected, this will often be because the relevant metric was terrible. Classic examples are papers about battery technologies reporting power rather than capacity, biomedical papers reporting effects on biomarkers rather than symptoms or mortality.

5. Correctly controlling for confounders is much, much harder than people typically give it credit for. Adding extra things to the list of things controlled for can create spurious correlations, and study authors are not incentivized to handle this correctly. The practical upshot is that observational studies only count if the effect size is very large.

answer by Elizabeth · 2019-11-10T04:14:30.529Z · LW(p) · GW(p)

One tactic I like to use is "how do they know this?", and asking myself or investigating if it's possible for their answer to demonstrate the thing they're claiming.

A lot of work doesn't tell you. Those aren't necessarily wrong, because they might have a good answer they're not incentivized to share, but at a minimum it's going to make it hard to learn from the work.

A lot of work claims to tell you, but when you look they are lying. For example, when I investigated the claim humans could do 4 hours of thought-work per day, I looked up the paper's citations, and found they referred to experiments of busy work. Even if those studies were valid, they couldn't possibly prove anything about thought-work. I consider "pretending to have sources and reasons" a worse sin than "not giving a source or reason"

More ambiguously, I spent a lot of time trying to figure out how much we could tell and at what resolution from ice core data. I still don't have a great answer on this for the time period I was interested in. But I learned enough to know that the amount of certainty the book I was reading (The Fate of Rome) was presenting data as more clear cut than it was.

On the other end, The Fall of Rome spends a lot of time explaining why pottery is useful in establishing economic and especially trade status of an area/era. This was pretty hard to verify from external sources because it's original research from the author, but it absolutely makes sense and produces a lot of claims and predictions that could be disproved. Moreover, none of the criticism I fond of Fall of Rome addressed his points on pottery- no one was saying "well I looked at Roman pottery and think the quality stayed constant through the 600s".

↑ comment by Eli Tyre (elityre) · 2019-11-10T05:07:27.358Z · LW(p) · GW(p)

Thanks.

This point in particular sticks with me:

I consider "pretending to have sources and reasons" a worse sin than "not giving a source or reason"

I notice that one of the things that tips me off that a scientist is good, is if her/his work demonstrates curiosity. Do they seem like they're actually trying to figure out the answer? Do they think though and address counterarguments, or just try to obscure those counterargument?

This seems related: a person who puts no source might still be sharing their actual belief, but a person who puts a fake source seems like they're trying to sound legitimate.

Replies from: pktechgirl

↑ comment by Elizabeth (pktechgirl) · 2019-11-10T19:57:17.430Z · LW(p) · GW(p)

Yes, this seems like a good guideline, although I can't immediately formalize how I detect curiosity. Vague list of things this made me think of:

I think this is a better guideline for books than scientific articles, which are heavily constrained by academic social and funding norms.
One good sign is if *I* feel curious in a concrete way when I read the book. What I mean by concrete is...

e.g. Fate of Rome had a ton of very specific claims about how climate worked and how historical climate conditions could be known. I spent a lot of time trying to verify these and even though I ultimately found them insufficiently supported, there was a concreteness that I still give positive marks for.
In contrast my most recently written epistemic spot check (not yet published), I spent a long time on several claims along the lines of "Pre-industrial Britain had a more favorable legal climate for entrepreneurship than continental Europe". I don't recall the author giving any specifics on what he meant by "more favorable", nor how he determined it was true. Investigating felt like a slog because I wasn't even sure what I was looking for.

I worry I'm being unfair here because maybe if I'd found lots of other useful sources I'd be rating the original book better. But when I investigated I found there wasn't even a consensus on whether Britain had a strong or weak patent system.

Moralizing around conclusions tends to inhibit genuine curiosity in me, although it can loop around to spite curiosity (e.g., Carol Dweck).

answer by Kristin Lindquist · 2019-11-02T20:23:40.409Z · LW(p) · GW(p)

Already many good answers, but I want to reinforce some and add others.

1. Beware of multiplicity - does the experiment include a large number of hypotheses, explicitly or implicitly? Implicit hypotheses include "Does the intervention have an effect on subjects with attributes A, B or C?" (subgroups) and "Does the intervention have an effect that is shown by measuring X, Y or Z?" (multiple endpoints). If multiple hypotheses were tested, were the results for each diligently reported? Note that multiplicity can be sneaky and you're often looking for what was left unsaid, such as a lack of plausible mechanism for the reported effect.

For example, take the experimental result "Male subjects who regularly consume Vitamin B in a non-multi-vitamin form have a greater risk of developing lung cancer (irrespective of dose)." Did they *intentionally* hypothesize that vitamin B would increase the likelihood of cancer, but only if 1) it was not consumed as part of a multi vitamin and 2) in a manner that was not dose-dependent? Unlikely! The real conclusion of this study should have been "Vitamin B consumption does not appear correlated to lung cancer risk. Some specific subgroups did appear to have a heightened risk, but this may be statistical anomaly."

2. Beware of small effect sizes and look for clinical significance - does the reported effect sound like something that matters? Consider the endpoint (e.g. change in symptoms of depression, as measured by the Hamilton Depression Rating Scale) and the effect size (e.g. d = 0.3, which is generally interpreted as a small effect). As a depressive person, I don't really care about a drug that has a small effect size.* I don't care if the effect is real but small or not real at all, because I'm not going to bother with that intervention. The "should I care" question cuts through a lot of the bullshit, binary thinking and the difficulty in interpreting small effect sizes (given their noisiness).

3. Beware of large effect sizes - lots of underpowered studies + publication bias = lots of inflated effect sizes reported. Andrew Gelman's "Type M" (magnitude) errors are a good way to look at this - an estimate of the how inflated the effect size is likely to be. However, this isn't too helpful unless you're ready to bust out R when reading research. Alternately, a good rule of thumb is to be skeptical of 1) large effect sizes reported from small N studies and 2) confidence intervals wide enough to drive a trunk through.

4. Beware of low prior odds - is this finding in a highly exploratory field of research, and itself rather extraordinary? IMO this is an under-considered conclusion of Ioannidis' famous "Why Most Published Research Findings are False" paper. This Shinyapp nicely illustrates "positive predictive value" (PPV), which takes into account bias & prior odds.

5. Consider study design - obviously look for placebo control, randomization, blinding etc. But also look for repeated measures designs, e.g. "crossover" designs. Crossover designs achieve far higher power with fewer participants. If you're eyeballing study power, keep this in mind.

6. Avoid inconsistent skepticism - for one, don't be too skeptical of research just because of its funding source. All researchers are biased. It's small potatoes $$-wise compared to a Pfizer, but postdoc Bob's career/identity is on the line if he doesn't publish. Pfizer may have $3 billion on the line for their Phase III clinical trial, but if Bob can't make a name for himself, he's lost a decade of his life and his career prospects. Then take Professor Susan who built her career on Effect X being real - what were those last 30 years for, if Effect X was just anomaly?

Instead, look at 1) the quality of the study design, 2) the quality and transparency of the reporting (including COI disclosures, preregistrations, the detail and organization in said preregistrations, etc).

7. Learn to love meta-analysis - Where possible, look at meta-analyses rather than individual studies. But beware: meta-analyses can suffer their own design flaws, leading to some people saying "lies, damn lies and meta-analysis." Cochrane is the gold standard. If they have a meta-analysis for the question at hand, you're in luck. Also, check out the GRADE criteria - a pragmatic framework for evaluating the quality of research used by Cochrane and others.

*unless there is high heterogeneity in the effect amongst a subgroup with whom I share attributes, which is why subgrouping is both hazardous and yet still important.

↑ comment by waveman · 2019-12-03T03:33:16.617Z · LW(p) · GW(p)

On bias see here https://www.bmj.com/content/335/7631/1202 and references. There is a lot of research about this. Note also that you do not even need to bias a particular researcher, just fund the researchers producing the answers you like, or pursuing the avenues you are interested in e.g. Coke's sponsorship of exercise research which produces papers suggesting that perhaps exercise is the answer.

One should not simply dismiss a study because of sponsorship, but be aware of what might be going on behind the scenes. And also be aware that people are oblivious to the effect that sponsorship has on them. One study of primary care doctors found a large effect on prescribing from free courses, dinners, etc, but the doctors adamantly denied any impact.

The suggestions of things to look for are valid and useful but often you just don't know what actually happened.

answer by romeostevensit · 2019-10-30T01:58:00.743Z · LW(p) · GW(p)

Sample size is related to how big an effect size you should be surprised by ie power. Big effect sizes in smaller populations = less surprising. Why is there no overall rule of thumb? Because it gets modified a bunch by the base rate of what you're looking at and some other stuff I'm not remembering off the top of my head.

In general I'd say there's enough methodological diversity that there's a lot of stuff I'm looking for as flags that a study wasn't designed well. For examples of such you can look at the inclusion criteria for meta-analyses.

There's also more qualitative things about how much I'm extrapolating based on the discussion section by the study authors. In the longevity posts for example, I laud a study for having a discussion section where the authors explicitly spend a great deal of time talking about what sorts of things are *not* reasonable to conclude from the study even though they might be suggestive for further research directions.

Confounds are kinda like building a key word map. I'm looking at the most well regarded studies in a domain, noting down what they're controlling for, then discounting studies that aren't controlling for them to varying degrees. This is another place where qualitative judgements creep in even in cochrane reviews where they are forced to just develop ad hoc 'tiers' of evidence (like A, B, C etc) and give some guidelines for doing so.

I have higher skepticism in general than I did years ago as I have learned about the number of ways that effects can sneak into the data despite honest intention by moderately competent scientists. I'm also much more aware of a fundamental problem with selection effects in that anyone running a study has some vested interest in framing hypotheses in various ways because nobody devotes themselves to something about which they're completely disinterested. This shows up as a problem in your own evaluation in that it's almost impossible to not sneak in isolated demands for rigor based on priors.

I'm also generally reading over the shoulder of whichever other study reviewers seem to be doing a good job in a domain. Epistemics is a team sport. An example of this is when Scott did a roundup of evidence for low carb diets and mentioning lots of other people doing meta reviews and some speculating about why different conclusions were reached eg Luke Muelhauser and I came down on the side that the VLC evidence seemed weak and Will Eden came down on the side that it seemed more robust, seemingly differing on how much weight we placed on inside view metabolic models vs outside view long term studies.

That's a hot take. It can be hard to just dump top level heuristics vs seeing what comes up from more specific questions/discussion.

answer by Charlie Steiner · 2019-10-30T09:03:36.807Z · LW(p) · GW(p)

Here's an answer for condensed matter physics:

Step 1: Read the title, journal name, author list, and affiliations.

By reading papers in a field, talking to people in the field, and generally keeping track of the field as a social enterprise, you should be able to place papers in a context even before reading them. People absolutely have reputations, and that should inform your priors. You should also have an understanding of what the typical research methods are to answer a certain question - check either the title or the abstract to make sure that the methods used match the problem.

Actually, you know what?

Step 0: Spend years reading papers and keeping track of people to develop an understanding of trust and reputation as various results either pan our or don't. Read a few textbooks to understand the physical basis of the commonly-used experimental and theoretical techniques, then check that understanding by reading more papers and keeping track of what kind of data quality is the standard in the field, how techniques are best applied, and which techniques and methods of analysis provide the most reliable results.

For example, by combining steps 0 and 1, you can understand that certain experimental techniques might be more difficult and easier to fool yourself with, but might be the best method available for answering some specific question. If you see a paper applying this technique to this sort of question, this actually should increase your confidence in the paper relative to the base rate for this technique, because it shows that the authors are exercising good judgment. Next...

Step 2: Read the abstract and look at the figures.

This is good for understanding the paper too, not just evaluating trustworthiness. Look for data quality (remember that you learned how to judge the data quality of the most common techniques in step 0) and whether they've presented it in a way that clearly backs up the core claims of the abstract, or presents the information you're trying to learn from the paper. Data that is merely suggestive of the authors' claims is actually a red flag, because remember, everyone just presents the nicest figure they can. Responsible scientists reduce their claims when the evidence is weak.

Step 3: Read the paper.

If you have specific parts you know you care about, you can usually just read those in detail and skim the rest. But if you really care about assessing this particular paper, check the procedures and compare it to your knowledge of how this sort of work should go. If there are specific parts that you want to check yourself, and you can do so, do so. This is also useful so you can...

Step 4: Compare it to similar papers.

You should have background knowledge, but it's also useful to keep similar papers (both in terms of what methods they used, and what problem they studied) directly on hand if you want to check something. If you know a paper that did a similar thing, use that to check their methods. Find some papers on the same problem and cross-check how they present the details of the problem and the plausibility of various answers, to get a feel for the consensus. Speaking of consensus, if there are two similar papers from way in the past that you found via Google Scholar and one of them has 10x the citations of the other, take that into account. When you notice confusing statements, you can check those similar papers to see how they handled it. But once you're really getting into the details, you'll have to...

Step 5: Follow up citations for things you don't understand or want to check.

If someone is using a confusing method or explanation, there should be a nearby citation. If not, that's a red flag. Find the citation and check whether it supports the claim in the original paper (recursing if necessary). Accept that this will require lots of work and thinking, but hey, at least this feeds back into step 0 so you don't have to do it as much next time.

Step 6: Ask a friend.

There are smart people out there. Hopefully you know some, so that if something seems surprising and difficult to understand, you can ask them what they think about it.

↑ comment by Elizabeth (pktechgirl) · 2019-11-10T04:18:43.513Z · LW(p) · GW(p)

if there are two similar papers from way in the past that you found via Google Scholar and one of them has 10x the citations of the other, take that into account.

This seems great for figuring out the consensus in a field, but not for identifying when the consensus is wrong.

answer by whales · 2019-10-30T16:47:29.878Z · LW(p) · GW(p)

Recapitulating something I've written about before:

You should first make a serious effort to formulate both the specific question you want answered, and why you want an answer. It may turn out surprisingly often that you don't need to do all this work to evaluate the study.

Short of becoming an expert yourself, your best bet is then to learn how to talk to people in the field until you can understand what they think about the paper and why—and also how they think and talk about these things. This is roughly what Harry Collins calls "interactional" expertise. (He takes gravitational-wave scientist Joe Weber's late work as an especially vivid example: "I can promise such lay readers that if they teach themselves a bit of elementary statistics and persevere with reading the paper, they will find it utterly convincing. Scientific papers are written to be utterly convincing; over the centuries their special language and style has been developed to make them read convincingly.... The only way to know that Weber’s paper is not to be read in the way it is written is to be a member of the ‘oral culture’ of the relevant specialist community." The full passage is very good.)

If you only learn from papers (or even textbooks and papers), you won't have any idea what you're missing. A lot of expertise is bound up in individual tacit knowledge and group dynamics that never get written down. This isn't to say that the 'oral culture' is always right, but if you don't have a good grasp of it, you will make at best slow progress as an outsider.

This is the main thing holding me back from running the course I've half-written on layperson evaluation of science. Most of the time, the best thing is just to talk to people. (Cold emails are OK; be polite, concise, and ask a specific question. Grad students tend to be generous with their time if you have an interesting question or pizza and beer. And I'm glad to answer physics questions by LW message.)

Short of talking to people, you can often find blogs in the field of interest. More rarely, you can also find good journalism doing the above kind of work for you. (Quanta is typically good in physics, enough so that I more or less trust them on other subjects.)

There's plenty to be said about primary source evaluation, which varies with field and which the other answers so far get at, but I think this lesson needs to come first.

answer by Elizabeth · 2019-11-12T19:45:04.222Z · LW(p) · GW(p)

If a psychology study doesn't prominently say who its subjects were, the answer is "undergrads at the university, predominantly those is psychology classes" and it is worthless.

↑ comment by habryka (habryka4) · 2019-11-12T21:08:53.616Z · LW(p) · GW(p)

I mean, lots of phenomena are likely to still be present in undergraduate psychology students, so it seems weird to say that the results are going to be worthless. Seems to me like it depends on the domain on how much you expect results to generalize from that population to others.

answer by TheMajor · 2019-10-30T11:49:11.687Z · LW(p) · GW(p)

Already partially mentioned by others, including OP.

I usually start with comparing the conclusion with my expectations (I'm painfully aware that this creates a confirmation bias, but what else am I supposed to compare it with). If they are sufficiently different I try to imagine how, using the method described by the authors, I would be able to get a positive result to their experiment conditional on my priors being true, i.e. their conclusion being false. This is basically the same as trying to figure out how I would run the experiment and which data would disprove my assumptions, and then seeing if the published results fall in that category.

Usually the buck stops there, most published research use methods that are sufficiently flimsy that (again, conditional on my priors), it is very likely the result was a fluke. This approach is pretty much the same as your third bullet point, and also waveman's point number 5. I would like to stress though that it's almost never enough to have a checklist of "common flaws in method sections" (although again, you have to start somewhere). Unfortunately different strengths and types of results in different fields require different methods.

A small Bayesian twist on the interpretation of this approach: when you're handed a paper (that doesn't match your expectations), that is evidence of something. I'm specifically looking at the chance that, conditional on my priors being accurate, the paper I'm given is still being published.

answer by Mary Chernyshenko · 2021-01-09T18:58:40.660Z · LW(p) · GW(p)

A thing I pay attention to is how seamlessly the introduction leads to the question the authors picked. If there is a jarring break I try to imagine what kind of intro the paper needs, and then it's usually clear if I do understand enough to think about it. For example, if the actual intro is written from the ecological angle but the imagined one is zoological, and I really need to understand it, I should just ask a zoologist.

12 comments

Comments sorted by top scores.

comment by Ruby · 2019-10-29T18:46:21.121Z · LW(p) · GW(p)

Edit: Awards for the best responses + reviews of answers HERE [LW(p) · GW(p)]. [LW · GW]

I think this is a really important question and I'm eager to see answers. I'm willing to put up $100 of my personal money as a prize for what I think is the best answer and another $50 for what I think is the best summary of multiple answers. (I'm willing to combine these of the best answer includes summary of other answers.)

This isn't a proper impact certificate, but if it were, I might be awarding this prize at 5% or less of the true value of the impact. So in offering $100, I'm saying the real impact could be worth like $2000 or more in my mind if it's a really good answer.

Assuming Eli is okay with this, I'll evaluate in two weeks, ending November 13 at 12:00 AM, and pledge to award within three weeks (for each day I'm late, I'll increase the prize amounts by 5% compounding).

A thing I would be interested in here is also people mentioning how they gained their ability to assess papers, e.g. "I can do this because I have a statistics degree" or "I can do this because of my domain expertise" and further bonus points on listing resources people could use to enhance their ability to assess research.

Replies from: elityre

↑ comment by Eli Tyre (elityre) · 2019-11-01T22:10:29.326Z · LW(p) · GW(p)

Assuming Eli is okay with this

This sounds cool to me!

comment by Ruby · 2019-12-02T23:51:38.504Z · LW(p) · GW(p)

Awards for the Best Answers

When this question was posted a month ago, I liked it so much that I offered $100 of my own money [LW(p) · GW(p)] for what I judged to be the best answer and another $50 to the best distillation. Here's what I think:

Overall prize for best answer ($100): Unnamed [LW(p) · GW(p)]

Additional prizes ($25): waveman [LW(p) · GW(p)], Bucky [LW(p) · GW(p)]

I will reach out to these authors via DM to arrange payment.

No one attempted to me what seemed like a proper distillation of other responses so I won't be awarding the distillation prize here, however I intend to write and publish my own distillation/synthesis of the responses soon.

Some thoughts on each of the replies:

Unnamed [LW(p) · GW(p)] [winner]: This answer felt very thorough and detailed, and it feels like it's a guide I could really follow to dramatically improve my ability to assess studies. I'm assuming limitations of LW's current editor meant the formatting couldn't be nicer, but I also really like Unnamed broke down his overall response into three main questions ("Is this just noise?", "Is there anything interesting going on here?" and "What is going on here?") and then presented further sub-questions and examples to help one assess the high-level questions.

I'd like to better summarize Unnamed's response, you should really just read it all.

waveman [LW(p) · GW(p)] [winner]: waveman's reply hits a solid amount of breadth in how to assess studies. I feel like his response is any easy guide I could pin up my wall and easily step through while reading papers. What I would really like to see is this response except further fleshed out with examples and resources, e.g. "read these specific papers or books on how studies get rigged." I'll note that I do have some pause with this response since other responders contradicted at least one part of it, e.g., Kristin Lindquist saying not to worry about the funding source of a study. I'd like to see these (perhaps only surface-level) disagreements resolved. Overall though, really solid answer that deserves its karma.

Bucky [LW(p) · GW(p)] [winner]: Bucky's answer is deliciously technical. Rather than discussing high-level qualitative consequences to pay attention to (e.g. funding source, has there been reproductions), Bucky dives and provides actual forumulas and guidance about sample sizes, effect sizes, etc. What's more, Bucky discusses how he applied this approach to concrete studies (80k's replication quiz) and the outcome. I love the detail of the reply and it being backed up by concrete usage. I will mention that Bucky opens by saying that he uses subconscious thresholds in his assessments but is interesting in discussing the levels other people use.

I do suspect that learning to apply the kinds of calculations Bucky points at is tricky and vulnerable to mistaken application. Probably a longer resource/more training is needed to be able to apply Bucky's approach successfully, but his answer at the least sets one on the right path.

Kristin Lindquist [LW(p) · GW(p)]: Kristin's answer is really very solid but feels like it falls short of the leading responses in terms of depth and guidance and doesn't add too much, though I do appreciate the links that were included. It's a pretty good summary. Also one of the best formatted of all answers given. I would like to see waveman and Kristin reach agreement on the question of looking funding sources.

jimrandomh [LW(p) · GW(p)]: Jim's answer was short but added important answers to the conversation that no one else had stated. I think his suggestion of ensuring you ask yourself about how you ended up reading a particular study is excellent and crucial. I'm also intrigued by his response that controlling for confounds is much, much harder than people typically think. I'd very much like to see a longer essay demonstrating this.

Elizabeth [LW(p) · GW(p)]: I feel like this answer solidly reminds me think to about core epistemological questions when reading a study, e.g., "how do they know this?"

Romeostevensit [LW(p) · GW(p)]: this answer added a few more things to look for not not included in other responses, e.g. giving more to authors who discuss what can't be concluded from their study. Also I like his mentioning that spurious effects can sneak into despite the honest intentions of moderately competent scientists. My experience with data analysis supports this. I'd like to see a discussion between Romeostenvsit and jimrandhomh since they both seem to have thoughts about confounds (and I further know they both have interest in nutrition research).

Charlie Steiner [LW(p) · GW(p)]: Good additional detail in this one, e.g. the instruction to compare papers to other similar papers and general encouragement to get a sense of what methods are reasonable. This is a good answer, just not as good as the very top answers. Would like to see some concrete examples to learn from with this one. I appreciate the clarification that this response is for Condensed Matter Physics. I'd be curious to see how other researchers feel it generalizes to their domains.

whales [LW(p) · GW(p)]: Good advice and they could be right that a lot of key knowledge is tacit (in the oral tradition) and not included in papers or textbooks. That seems like something well worth remembering. I'd be rather keen to see whales's course on layperson evaluation of science.

The Major [LW(p) · GW(p)]: Response seems congruent with other answers but is much shorter and less detailed them.

Replies from: Ruby, Bucky

↑ comment by Ruby · 2019-12-03T19:37:06.453Z · LW(p) · GW(p)

It would be good know if offering prizes like this is helpful in producing counterfactually more and better responses. So, to all those who responded with the great answers, I have a question:

How did the offer of a prize influence your contribution? Did it make any difference? If so, how come?

↑ comment by Bucky · 2019-12-03T13:52:40.337Z · LW(p) · GW(p)

Thanks Ruby.

Good summary of my answer; by the time I got round to writing mine there were so many good qualitative summaries I wanted to do something different. I think you’ve hit the nail on the head with the main weakness being difficulty in application, particularly in estimating Cohen’s d.

I am currently taking part in replication markets and basing my judgements mainly on experimental power. Hopefully this will give me a better idea of what works and I may write an updated guide next year.

As a data point r.e. the prize, I’m pretty sure that if the prize wasn’t there I would have done my usual and intended to write something and never actually got round to it. I think this kind of prize is particularly useful for questions which take a while to work on and attention would otherwise drift.

Replies from: Ruby

↑ comment by Ruby · 2019-12-03T19:22:39.387Z · LW(p) · GW(p)

Hopefully this will give me a better idea of what works and I may write an updated guide next year.

I'd be excited to see that.

As a data point r.e. the prize, I’m pretty sure that if the prize wasn’t there I would have done my usual and intended to write something and never actually got round to it. I think this kind of prize is particularly useful for questions which take a while to work on and attention would otherwise drift.

Oh, that's helpful to know and reminds me that I intended to ask respondents how the offer a prize affected their contributions.

comment by Stefan_Schubert · 2019-10-30T10:17:31.680Z · LW(p) · GW(p)

A recent paper developed a statistical model for predicting whether papers would replicate.

We have derived an automated, data-driven method for predicting replicability of experiments. The method uses machine learning to discover which features of studies predict the strength of actual replications. Even with our fairly small data set, the model can forecast replication results with substantial accuracy — around 70%. Predictive accuracy is sensitive to the variables that are used, in interesting ways. The statistical features (p-value and effect size) of the original experiment are the most predictive. However, the accuracy of the model is also increased by variables such as the nature of the finding (an interaction, compared to a main effect), number of authors, paper length and the lack of performance incentives. All those variables are associated with a reduction in the predicted chance of replicability.

...

The first result is that one variable that is predictive of poor replicability is whether central tests describe interactions between variables or (single-variable) main effects. Only eight of 41 interaction effect studies replicated, while 48 of the 90 other studies did.

Another, unrelated, thing is that authors often make inflated interpretations of their studies (in the abstract, the general discussion section, etc). Whereas there is a lot of criticism of p-hacking and other related practices pertaining to the studies themselves, there is less scrutiny of how authors interpret their results (in part that's understandable, since what counts as a dodgy interpretation is more subjective). Hence when you read the methods and results sections it's good to think about whether you'd make the same high-level interpretation of the results as the authors.

comment by Ben Pace (Benito) · 2019-12-02T04:39:53.200Z · LW(p) · GW(p)

This question has loads of great answers, with people sharing their hard-earned insights about how to engage with modern scientific papers and make sure to get the truth out of them, so I curated it.

Replies from: Ruby

↑ comment by Ruby · 2019-12-03T14:40:50.146Z · LW(p) · GW(p)

Forgive me if I rant a little against this curation notice.

hard-earned insights about what's bad about science and how to get truth out of it anyway.

I'm not sure I'd frame people's responses quite this way, i.e., I think that's framing people as having a very negative valence towards current science in a way I'm not sure is there and I would be reluctant to assign to them. Or maybe more importantly, I don't think that captures the prompt being replied to. If I'd authored a response here, I'd dislike this notice for somehow trying to make my response "political" in a way I don't endorse, like it's taking the opportunity for a "boo science" that wasn't the point for me.

Conservatively, I read people's responses as being built on the basis that studies vary in trustworthiness and answers are about methods for assessing trustworthiness/strength of evidence. Answers are about how scientific studies can be done poorly, but aren't a response to the prompt of "what are ways in which science is bad?"

Sorry, I'm probably reading too much into the wording of a single sentence. Charitably, I could read the notice as saying the answers given contain ways in which scientific studies can be bad and how to filter those ones out (or trust them to that appropriate extent).

Replies from: Benito

↑ comment by Ben Pace (Benito) · 2019-12-05T03:39:39.987Z · LW(p) · GW(p)

Yeah, I think you're right. Edited.

comment by habryka (habryka4) · 2020-12-14T03:34:34.625Z · LW(p) · GW(p)

The answers to this question were great, and I've referred multiple people to this post when they asked me how to go about reading papers. This also became particularly relevant in 2020 when quickly assessing the quality of papers had a surprisingly large effect on what immediate quarantine and social distancing actions to take.

comment by Ruby · 2020-12-14T03:13:52.946Z · LW(p) · GW(p)

I feel like the answers to this question are a fantastic resource. Comment on them would be valuable, but really I'd like to see them enshrined as wisdom for the scholars.

I may have a conflict of interest here since I put up a bounty/prize for answers to this question - and I wasn't disappointed.

(At the same time, I do recommend Paper Reading for Gears, which this is not.)

How do you assess the quality / reliability of a scientific study?

Contents

Answers

Calculations

Thresholds

Test

12 comments

Awards for the Best Answers