Comments sorted by top scores.
comment by MichaelA
· score: 7 (5 votes) · LW
) · GW
Ways of describing the “trustworthiness” of probabilities
While doing research for a post on the idea of a distinction between “risk” and “(Knightian) uncertainty”, I came across a surprisingly large number of different ways of describing the idea that some probabilities may be more or less “reliable”, “trustworthy”, “well-grounded”, etc. than others, or things along those lines. (Note that I’m referring to the idea of different degrees of trustworthiness-or-whatever, rather than two or more fundamentally different types of probability that vary in trustworthiness-or-whatever.)
I realised that it might be valuable to write a post collecting [EA · GW] all of these terms/concepts/framings together, analysing the extent to which some may be identical to others, highlighting ways in which they may differ, suggesting ways or contexts in which some of the concepts may be superior to others, etc. But there’s already too many things I’m working on writing at the moment, so this is a low effort version of that idea - this is basically just a collection of the concepts, relevant quotes, and links where readers can find more.
Comments on this post will inform whether I take the time to write something more substantial/useful on this topic later (and, if so, precisely what and how).
Note that this post does not explicitly cover the “risk vs uncertainty” framing itself, as I’m already writing a separate, more thorough post on that.
Dominic Roser speaks of how “high” or “low” the epistemic credentials of our probabilities are. He writes:
The expression ‘‘epistemic credentials of probabilities’’ is a shorthand for two things: First, it refers to the credentials of the epistemic access to the probabilities: Are our beliefs about the probabilities well-grounded? Second—and this applies only to the case of subjective probabilities—it refers to the credentials of the probabilities themselves: Are our subjective probabilities—i.e. our degrees of belief—well-grounded?
He further explains what he means by this in a passage that also alludes to many other ways of describing or framing an idea along the lines of the trustworthiness of given probabilities:
What does it mean for probabilities to have (sufficiently) high epistemic credentials? It could for example mean that we can calculate or reliably estimate the probabilities (Keynes 1937, p. 214; Gardiner 2010, p. 7; Shue 2010, p. 148) rather than just guesstimate them; it could mean that our epistemic access allows for unique, numerical or precise probabilities (Kelsey and Quiggin 1992, p. 135; Friedman 1976, p. 282; Kuhn 1997, p. 56) rather than for qualitative and vague character- izations of probabilities or for ranges of probabilities; or it could mean that our epistemic access allows for knowledge of probabilities, in particular for knowledge that is certain, or which goes beyond the threshold of being extremely insecure, or which is not only based on a partial theory that is only valid ceteris paribus (Hansson 2009, p. 426; Rawls 1999, p. 134; Elster 1983, p. 202).
These examples from the literature provide different ways of spelling out the idea that our epistemic situation with regard to the probabilities must be of sufficient quality before we can properly claim to have probabilities. I will not focus on any single one of those ways. I am only concerned with the fact that they are all distinct from the idea that the mere existence of probabilities and mere epistemic access, however minimal, is sufficient. This second and narrower way of understanding ‘‘having probabilities’’ seems quite common for distinguishing risk from uncertainty. For example, in his discussion of uncertainty, Gardiner (2006, p. 34), based on Rawls (1999, p. 134), speaks of lacking, or having reason to sharply discount, information about probabilities, Peterson (2009, p. 6) speaks of it being virtually impossible to assign probabilities, and Bognar (2011, p. 331) says that precautionary measures are warranted whenever the conditions that Rawls described are approximated. This indicates that in order to distinguish risk from uncertainty, these authors do not examine whether we have probabilities at all, but rather whether we have high-credentials probabilities rather than low-credentials probabilities.
Note also that some believe that scientific progress can turn contexts of uncertainty into contexts of risk. For example, in the third assessment report the IPCC gave a temperature range but it did not indicate the probability of staying within this range. In the fourth assessment report, probabilities were added to the range. If one believes that scientific progress can move us from uncertainty to risk, this indicates as well that one’s risk-uncertainty distinction is not about the sheer availability of probabilities, i.e. having probabilities simpliciter. Given the gradual progress of science, it would be surprising if, after some time, probabilities suddenly became available at all. It seems more plausible that probabilities which were available all along changed from having hardly any credentials (in which case we might call them hunches) to gradually having more credentials. And when they cross some threshold of credentials, then—so I interpret parts of the literature—we switch from uncertainty to risk and we can properly claim to have probabilities. (line breaks added)
Resilience (of credences)
Amanda Askell discusses the idea that we can have “more” or “less” resilient credences in this [? · GW] talk and this book chapter.
From the talk:
if I thought there was a 50% chance that I would get $100, there’s actually a difference between a low resilience 50% and a high resilience 50%.
I’m going to argue that, if your credences are low resilience, then the value of information in this domain is generally higher than it would be in a domain where your credences are high resilience. And, I’m going to argue that this means that actually in many cases, we should prefer interventions with less evidential support, all else being equal.
[...] One kind of simple formulation of resilience [...] is that credo-resilience is how stable you expect your credences to be in response to new evidence. If my credences are high resilience, then there’s more stability. I don’t expect them to vary that much as new evidence comes in, even if the evidence is good and pertinent to the question. If they’re low resilience, then they have low stability. I expect them to change a little in response to new evidence. That’s true in the case of the untested coin, where I just have no data about how good it is, so the resilience of my credence of 50% is fairly low.
It’s worth noting that resilience levels can reflect either the set of evidence that you have about a proposition, or your prior about the proposition. So, if it’s just incredibly plausible that the coins are generally fair. For example, if you saw me simply pick the coin up out of a stack of otherwise fair coins, in this case you would have evidence that it’s fair. But if you simply live in a world that doesn’t include a lot of very biased coins, then your prior might be doing a lot of the work that your evidence would otherwise do. These are the two things that generate credo-resilience.
In both cases, with the coin, your credence that the coin will land heads on the next flip is the same, it’s 0.5. Your credence of 0.5 about the tested coin is resilient, because you’ve done a million trials of this coin. Whereas, your credence about the untested coin is quite fragile. It could easily move in response to new evidence, as we see here.
Later in the talk, Askell highlights an implication of this idea, and how it differs from the idea of just not having precise probabilities at all:
A lot of people seems to be kind of unwilling to assert probability estimates about whether something is going to work or not. I think a really good explanation for this is that, in cases where we don’t have a lot of evidence, our credences about how good our credences are, are fairly low.
We basically think it’s really likely that we’re going to move around a lot in response to new evidence. We’re just not willing to assert a credence that we think is just going to be false, or inaccurate once we gain a little bit more evidence. Sometimes people think you have mushy credences, that you don’t actually have precise probabilities that you can assign to claims like, “This intervention is effective to Degree N.” I actually think resilience might be a good way of explaining that away, to say, “No. You can have really precise estimates. You just aren’t willing to assert them.”
(This comment thread [EA(p) · GW(p)] seems to me to suggest that the term “robustness of credences” may mean the same thing as “resilience of credences”, but I’m not sure about that.)
Evidential weight (balance vs weight of evidence)
In the book chapter linked to above, Askell also discusses the idea of evidential weight (or the idea of the weight of the evidence, as opposed to the balance of evidence). This seems quite similar to the idea of credence resilience.
The balance of the evidence refers to how decisively the evidence supports the proposition. The weight of the evidence is the total amount of relevant evidence that we have.
Since I can’t easily copy and paste from that chapter, for further info see pages 39-41 of that chapter (available in the preview I linked to).
Probability distributions (and confidence intervals)
Policy choice can still do justice to precautionary intuitions even when making use of probabilities. I submit that what drives precautionary intuitions is that in cases where there is little and unreliable evidence, our subjective probability distributions should exhibit a larger spread around the best guess. These spread out probability distributions yield precautionary policy-making when they are combined with, for example, the general idea of diminishing marginal utility (Stern 2007, p. 38) or the idea that an equal probability of infringing our descendants’ rights and bequeathing more to them than we owe them does not cancel each other out (Roser and Seidel 2017, p. 82). (emphasis added)
And Nate Soares [LW · GW] writes:
we are bounded reasoners, and we usually can't consider all available hypotheses. [...]
Bounded Bayesian reasoners should expect that they don't have access to the full hypothesis space. Bounded Bayesian reasoners can expect that their first-order predictions are incorrect due to a want of the right hypothesis, and thus place high credence on "something I haven't thought of", and place high value on new information or other actions that expand their hypothesis space. Bounded Bayesians can even expect that their credence for an event will change wildly as new information comes in.
[...] if I expect that I have absolutely no idea what the black swans will look like but also have no reason to believe black swans will make this event any more or less likely, then even though I won't adjust my credence further, I can still increase the variance of my distribution over my future credence for this event.
In other words, even if my current credence is 50% I can still expect that in 35 years (after encountering a black swan or two) my credence will be very different. This has the effect of making me act uncertain about my current credence, allowing me to say "my credence for this is 50%" without much confidence. So long as I can't predict the direction of the update, this is consistent Bayesian reasoning. (emphasis added)
A good, quick explanation, accompanied by diagrams, can be found in this comment [EA(p) · GW(p)].
Precision, sharpness, vagueness
(These ideas seem closely related to the ideas of probability distributions and confidence intervals [above], and to the concept of haziness [below].)
In this paper, Adam Elga writes:
Sometimes one’s evidence for a proposition is sharp. For example: You’ve tossed a biased coin thousands of times. 83% of the tosses landed heads, and no pattern has appeared even though you’ve done a battery of statistical tests. Then it is clear that your confidence that the next toss will land heads should be very close to 83%.
Sometimes one’s evidence for a proposition is sparse but with a clear upshot. For example: You have very little evidence as to whether the number of humans born in 1984 was even. But it is clear that you should be very near to 50% confident in this claim.
But sometimes one’s evidence for a proposition is sparse and unspecific. For example: A stranger approaches you on the street and starts pulling out objects from a bag. The first three objects he pulls out are a regular-sized tube of toothpaste, a live jellyfish, and a travel-sized tube of toothpaste. To what degree should you believe that the next object
he pulls out will be another tube of toothpaste?
It is very natural in such cases to say: You shouldn’t have any very precise degree of confidence in the claim that the next object will be toothpaste. It is very natural to say: Your degree of belief should be indeterminate or vague or interval-valued. On this way of thinking, an appropriate response to this evidence would be a degree of confidence represented not by a single number, but rather by a range of numbers. The idea is that your probability that the next object is toothpaste should not equal 54%, 91%, 18%, or any other particular number. Instead it should span an interval of values, such as [10%, 80%]. (emphasis added)
Elga then quotes various authors making claims along those lines, and writes:
These authors all agree that one’s evidence can make it downright unreasonable to have sharp degrees of belief. The evidence itself may call for unsharp degrees of belief, and this has nothing to do with computational or representational limitations of the believer. Let me write down a very cautious version of this claim:
UNSHARP: It is consistent with perfect rationality that one have unsharp degrees of belief.
However, Elga spends the rest of the paper arguing against this claim, and arguing instead (based on a type of Dutch book argument) for the following claim:
SHARP: Perfect rationality requires one to have sharp degrees of belief.
(Elga’s arguments seem sound to me, but I think they still allow for representing our beliefs as probability distributions that do have some mean or central or whatever value, and then using that value in many of the contexts Elga talks about. Thus, in those contexts, we’d act as if we have a “sharp degree of belief”, but we could still be guided by the shape and width of our probability distributions when thinking about things like how valuable additional information would be. But I’m not an expert on these topics, and haven’t thought about this stuff in depth.)
See also the Wikipedia article on imprecise probability.
Chris Smith (I believe that’s their name, based on this post [EA · GW]) writes:
Consider a handful of statements that involve probabilities:
A hypothetical fair coin tossed in a fair manner has a 50% chance of coming up heads.
When two buddies at a bar flip a coin to decide who buys the next round, each person has a 50% chance of winning.
Experts believe there’s a 20% chance the cost of a gallon of gasoline will be higher than $3.00 by this time next year.
Dr. Paulson thinks there’s an 80% chance that Moore’s Law will continue to hold over the next 5 years.
Dr. Johnson thinks there’s a 20% chance quantum computers will commonly be used to solve everyday problems by 2100.
Kyle is an atheist. When asked what odds he places on the possibility that an all-powerful god exists, he says “2%.”
I’d argue that the degree to which probability is a useful tool for understanding uncertainty declines as you descend the list.
The first statement is tautological. When I describe something as “fair,” I mean that it perfectly conforms to abstract probability theory.
In the early statements, the probability estimates can be informed by past experiences with similar situations and explanatory theories.
In the final statement, I don’t know what to make of the probability estimate.
The hypothetical atheist from the final statement, Kyle, wouldn’t be able to draw on past experiences with different realities (i.e., Kyle didn’t previously experience a bunch of realities and learn that some of them had all-powerful gods while others didn’t). If you push someone like Kyle to explain why they chose 2% rather than 4% or 0.5%, you almost certainly won’t get a clear explanation.
If you gave the same “What probability do you place on the existence of an all-powerful god?” question to a number of self-proclaimed atheists, you’d probably get a wide range of answers.
I bet you’d find that some people would give answers like 10%, others 1%, and others 0.001%. While these probabilities can all be described as “low,” they differ by orders of magnitude. If probabilities like these are used alongside probabilistic decision models, they could have extremely different implications. Going forward, I’m going to call probability estimates like these “hazy probabilities.”
Placing hazy probabilities on the same footing as better-grounded probabilities (e.g., the odds a coin comes up heads) can lead to problems. (bolding added)
Hyperpriors, credal sets, and other things I haven't really learned about
Bayesian approaches to probability treat it as a degree of belief and thus they do not draw a distinction between risk and a wider concept of uncertainty: they deny the existence of Knightian uncertainty. They would model uncertain probabilities with hierarchical models, i.e. where the uncertain probabilities are modelled as distributions whose parameters are themselves drawn from a higher-level distribution (hyperpriors). (emphasis added)
I haven’t looked into this and don’t properly understand it, so I won’t say more about it here, but I think it’s relevant. (This also might be related to the idea of confidence intervals mentioned earlier; as stated at the top, this is a low-effort version of this post where I’m not really trying to explain how the different framings might overlap or differ.)
The ideas of a credal set and of robust Bayesian analysis also seem relevant, but I have extremely limited knowledge on those topics.
I hope you found this somewhat useful. As stated earlier, comments on this post will inform whether I take the time to write something more substantial/useful on this topic later (and, if so, precisely what and how).
Also, if you know of another term/concept/framing that’s relevant, please add a comment mentioning it, to expand the collection here.
comment by mr-hire
· score: 3 (2 votes) · LW
) · GW
I like this and would find a post moderately valuable. I think sometimes posts with a lot of synonyms are hard to have take aways from, because it's hard to remember all the synonyms. What I think is useful is comparing and contrasting the different takes, creating a richer view of the whole framework by examining it from many angles.
Re Knightian Uncertainty vs. Risk, I wrote a post that discusses the interaction of different types of risks (including knightian) here: https://www.lesswrong.com/posts/eA9a5fpi6vAmyyp74/how-to-understand-and-mitigate-risk [LW · GW]
comment by MichaelA
· score: 1 (1 votes) · LW
) · GW
Thanks for the feedback!
I think sometimes posts with a lot of synonyms are hard to have take aways from, because it's hard to remember all the synonyms. What I think is useful is comparing and contrasting the different takes, creating a richer view of the whole framework by examining it from many angles.
Yeah, I'd agree with that, and it's part of why fleshing this out is currently low priority for me (since the latter approach takes actual work!), but remains theoretically on the list :)
comment by JesseClifton
· score: 2 (2 votes) · LW
) · GW
There are "reliabilist" accounts of what makes a credence justified. There are different accounts, but they say (very roughly) that a credence is justified if it is produced by a process that is close to the truth on average. See (this paper)[https://philpapers.org/rec/PETWIJ-2].
Frequentist statistics can be seen as a version of reliabilism. Criteria like the Brier score for evaluating forecasters can also be understood in a reliabilist framework.
comment by MichaelA
· score: 1 (1 votes) · LW
) · GW
Psychology: An Imperfect and Improving Science
This is an essay I wrote in 2017 as coursework for the final year of my Psychology undergrad degree. (That was a year before I learned about EA and the rationalist movement.)
I’m posting this as a shortform comment, rather than as a full post, because it’s now a little outdated, it’s just one of many things that people have written on this topic, and I don’t think the topic is of central interest to a massive portion of LessWrong readers. But I do think it holds up well, is pretty clear, and makes some points that generalise decently beyond psychology (e.g., about drawing boundaries between science and pseudoscience, evaluating research fields, and good research practice).
I put the references in a “reply” to this.
Psychology's scientific status has been denied or questioned by some (e.g., Berezow, 2012; Campbell, 2012). Evaluating such critiques and their rebuttals requires defining “science”, considering what counts as psychology, and exploring how unscientific elements within a field influence the scientific standing of that field as a whole. This essay presents a conception of “science” that consolidates features commonly seen as important into a family resemblance model. Using this model, I argue psychology is indeed a science, despite unscientific individuals, papers, and practices within it. However, these unscientific practices make psychology less scientific than it could be. Thus, I outline their nature and effects, and how psychologists are correcting these issues.
Addressing whether psychology is a science requires specifying what is meant by “science”. This is more difficult than some writers seem to recognise. For example, Berezow (2012) states we can “definitively” say psychology is non-science “[b]ecause psychology often does not meet the five basic requirements for a field to be considered scientifically rigorous: clearly defined terminology, quantifiability, highly controlled experimental conditions, reproducibility and, finally, predictability and testability.” However, there are fields that do not meet those criteria whose scientific status is generally unquestioned. For example, astronomy and earthquake science do not utilise experiments (Irzik & Nola, 2014). Furthermore, Berezow leaves unmentioned other features associated with science, such as data-collection and inference-making (Irzik & Nola, 2011). Many such features have been noted by various writers, though some are contested by others or only present or logical in certain sciences. For example, direct observation of the matters of interest has been rightly noted as helping make fields scientific, as it reduces issues like the gap between self-reported intentions and the behaviours researchers seek to predict (Godin, Conner, & Sheeran, 2005; Rhodes & de Bruijn, 2013; Sheeran, 2002; Skinner, 1987). However, self-reported intentions are still useful predictors of behaviour and levers for manipulating it (Godin et al., 2005; Rhodes & de Bruijn, 2013; Sheeran, 2002), and science often productively investigates constructs such as gravity that are not directly observable (Bringmann & Eronen, 2016; Chomsky, 1971; Fanelli, 2010; Michell, 2013). Thus, definitions of science would benefit from noting the value of direct observation, but cannot exclude indirect measures or unobservable constructs. This highlights the difficulty – or perhaps impossibility – of defining science by way of a list of necessary and sufficient conditions for scientific status (Mahner, 2013).
An attractive solution is instead constructing a family resemblance model of science (Dagher & Erduran, 2016; Irzik & Nola, 2011, 2014; Pigliucci, 2013). Family resemblance models are sets of features shared by many but not all examples of something. To demonstrate, three characteristics common in science are experiments, double-blind trials, and the hypothetico-deductive method (Irzik & Nola, 2014). A definition of science omitting these would be missing something important. However, calling these “necessary” excludes many sciences; for example, particle physics would be rendered unscientific for lack of double-blind trials (Cleland & Brindell, 2013; Irzik & Nola, 2014). Thus, a family resemblance model of science only requires a field to have enough scientific features, rather than requiring the field to have all such features. The full list of features this model should include, the relative importance of each feature, and what number or combination is required for something to be a “science” could all be debated. However, for showing that psychology is a science, it will suffice to provide a rough family resemblance model incorporating some particularly important features, which I shall now outline.
Firstly, Berezow's (2012) “requirements”, while not actually necessary for scientific status, do belong in a family resemblance model of science. That is, when these features can be achieved, they make a field more scientific. The importance of reproducibility is highlighted also by Kahneman (2014) and Klein et al. (2014a, 2014b), and that of testability or falsifiability is also mentioned by Popper (1957) and Ferguson and Heene (2012). These features are related to the more fundamental idea that science should be empirical; claims should be required to be supported by evidence (Irzik & Nola, 2011; Pigliucci, 2013). Together, these features allow science to be self-correcting, incrementally progressing towards truth by accumulation of evidence and peer-review of ideas and findings (Open Science Collaboration, 2015). This is further supported by scientists' methods and results being made public and transparent (Anderson, Martinson, & De Vries, 2007, 2010; Nosek et al., 2015; Stricker, 1997). Additionally, findings and predictions should logically cohere with established theories, including those from other sciences (Lilienfeld, 2011; Mahner, 2013). These features all support science's ultimate aims to benefit humanity by explaining, predicting, and controlling phenomena (Hansson, 2013; Irzik & Nola, 2014; Skinner, cited in Delprato & Midgley, 1992). Each feature may not be necessary for scientific status, and many other features could be added, but the point is that each feature a field possesses makes that field more scientific. Thus, armed with this model, we are nearly ready to productively evaluate the scientific status of psychology.
However, two further questions must first be addressed: What is psychology, and how do unscientific occurrences within psychology affect the scientific status of the field as a whole? For example, it can generally be argued parapsychology is not truly part of psychology, for reasons such as its lack of support from mainstream psychologists. However, there are certain more challenging instances, such as the case of a paper by Bem (2011) claiming to find evidence for precognition. This used accepted methodological and analytical techniques, was published in a leading psychology journal, and was written by a prominent, mainstream psychologist. Thus, one must accept that this paper is, to a substantial extent, part of psychology. It therefore appears important to determine whether Bem's paper exemplifies science. It certainly has many scientific features, such as use of experiments and evidence. However, it lacks other features, such as logical coherence with the established principle of causation only proceeding forwards in time.
But it is unnecessary here to determine whether the paper is non-science, insufficiently scientific, or bad science, because, regardless, this episode shows psychology as a field being scientific. This is because scientific features such as self-correction and reproducibility are most applicable to a field as a whole, rather than to an individual scientist or article, and these features are visible in psychology's response to Bem's (2011) paper. Replication attempts were produced and supported the null hypothesis; namely, that precognition does not occur (Galak, LeBoeuf, Nelson, Simmons, 2012; Ritchie, Wiseman, & French, 2012; Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012). Furthermore, publicity, peer-review, and self-correction of findings and ideas were apparent in those failed replications and in commentary on Bem's paper (Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011; Francis, 2012; LeBel & Peters, 2011). Peers discussed many issues with Bem's article, such as several variables having been recorded by Bem's experimental program yet not mentioned in the study (Galak et al., 2012; Ritchie et al., 2012), suggesting that the positive results reported may have been false positives emerging by chance from many, mostly unreported analyses. Wagenmakers et al. (2011) similarly noted other irregularities and unexplained choices in data transformation and analysis, and highlighted that Bem had previously recommended to psychologists: “If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. […] Go on a fishing expedition for something—anything—interesting” (Bem, cited in Wagenmakers et al., 2011). These responses to Bem’s study by psychologists highlight that, while the scientific status of that study is highly questionable, isolated events such as that need not overly affect the scientific status of the entire field of psychology.
Indeed, psychology's response to Bem's (2011) paper exemplifies ways in which the field in general fits the family resemblance model of science outlined earlier. This model captures how different parts of psychology can each be scientific, despite showing different combinations of scientific features. For example, behaviourists may use more direct observation and clearly defined terminology (see Delprato & Midgley, 1992; Skinner, 1987), while evolutionary psychologists better integrate their theories and findings with established theories from other sciences (see Burke, 2014; Confer et al., 2010). These features make subfields that have them more scientific, but lacking one feature does not make a subfield non-science. Similarly, while much of psychology utilises controlled experiments, those parts that do not, like longitudinal studies of the etiology of mental disorders, can still be scientific if they have enough other scientific features, such as accumulation of evidence to increase our capacity for prediction and intervention.
Meanwhile, other scientific features are essentially universal in psychology. For example, all psychological claims and theories are expected to be based on or confirmed by evidence, and are rejected or modified if found not to be. Additionally, psychological methods and findings are made public by publication, with papers being peer-reviewed before this and open to critique afterwards, facilitating self-correction. Such self-correction can be seen in the response to Bem's (2011) paper, as well as in how most psychological researchers now reject the untestable ideas of early psychoanalysis (see Cioffi, 2013; Pigliucci, 2013). Parts of psychology vary in their emphasis on basic versus applied research; for example, some psychologists investigate the processes underlying sadness while others conduct trials of specific cognitive therapy techniques for depression. However, these various branches can support each other, and all psychological research ultimately pursues benefitting humanity by explaining, predicting, and controlling phenomena. Indeed, while there is much work to be done and precision is rarely achieved, psychology can already make predictions much more accurate than chance or intuition in many areas, and thus provides benefits as diverse as anxiety-reduction via exposure therapy and HIV-prevention via soap operas informed by social-cognitive theories (Bandura, 2002; Lilienfeld, Ritschel, Lynn, Cautin, & Latzman, 2013; Zimbardo, 2004). All considered, most of psychology exemplifies most important scientific features, and thus psychology should certainly be considered a science.
However, psychology is not as scientific as it could be. Earlier I noted that isolated papers reporting inaccurate findings and utilising unscientific practices, as Bem (2011) seems highly likely to have, should not significantly affect psychology's scientific status, as long as the field self-corrects adequately. However, as several commentators on Bem's paper noted, more worrying is what that paper reflects regarding psychology more broadly, given that it largely met or exceeded psychology's methodological, analytical, and reporting standards (Francis, 2012; LeBel & Peters, 2011; Wagenmakers et al., 2011). The fact Bem met these standards, yet still “discovered” and got published results that seem to violate fundamental principles about how causation works, highlights the potential prevalence of spurious findings in psychological literature. These findings could result from various flaws and biases, yet might fail to be recognised or countered in the way Bem's report was if they are not as clearly false; indeed, they may be entirely plausible, yet inaccurate (LeBel & Peters, 2011). Thus, I will now discuss how critiques regarding Bem's paper apply to much of mainstream psychology.
Firstly, the kind of “fishing expedition” recommended by Bem (cited in Wagenmakers et al., 2011) is common in psychology. Researchers often record many variables, and have flexibility in which variables, interactions, participants, data transformations, and statistics they use in their analyses (John, Loewenstein, & Prelec, 2012). Wagenmakers et al. (2012) note that such practices are not inherently problematic, and indeed such explorations are useful for suggesting hypotheses to test in a confirmatory manner. The issue is that often these explorations are inadequately reported and are presented as confirmatory themselves, despite the increased risk of false positives when conducting multiple comparisons (Asendorpf et al., 2013; Wagenmakers et al., 2012). Neuropsychological studies can be particularly affected by failures to control for multiple comparisons, even if all analyses are reported, because analysis of brain activity makes huge numbers of comparisons the norm. Thus, without statistical controls, false positives are almost guaranteed (Bennett, Baird, Miller, & Wolford, 2009). The issue of uncontrolled multiple comparisons, whether reported or not, causing false positives can be compounded by hindsight bias making results seem plausible and predictable in retrospect (Wagenmakers et al., 2012). This can cause overconfidence in findings and make researchers feel comfortable writing articles as if these findings were hypothesised beforehand (Kerr, 1998). These practices inflate the number of false discoveries and spurious confirmations of theories in psychological literature.
This is compounded by publication bias. Journals are more likely to publish novel and positive results than replications or negative results (Ferguson & Heene, 2012; Francis, 2012; Ioannidis, Munafò, Fusar-Poli, Nosek, & David, 2014; Kerr, 1998). One reason for this is that, despite the importance of self-correction and incremental progress, replications or negative results are often treated as not show anything substantially interesting (Klein et al., 2014b). Another reason is the idea that null results are hard to interpret or overly likely to be false negatives (Ferguson & Heene, 2012; Kerr, 1998). Psychological studies regularly have insufficient power; their sample sizes mean that, even if an effect of the expected size does exist, the chance of not finding it is substantial (Asendorpf et al., 2013; Bakker, Hartgerink, Wicherts, & van der Maas, 2016). Further, the frequentist statistics typically used by psychologists cannot clearly quantify the support data provides for null hypotheses; these statistics have difficulty distinguishing between powerful evidence for no effect and simply a failure to find evidence for an effect (Dienes, 2011). While concerns about the interpretability of null results are thus often reasonable, they distort the psychological literature's representation of reality (see Fanelli, 2010; Kerr, 1998). Publication bias also takes the form of researchers being more likely to submit for publication those studies that revealed positive results (John et al., 2012). This can occur because researchers themselves also often find negative results difficult to interpret, and know they are less likely to be published or to lead to incentives like grants or prestige (Kerr, 1998; Open Science Collaboration, 2015). Thus, flexibility in analysis, failure to control for or report multiple comparisons, presentation of exploratory results as confirmatory, publication bias, low power, and difficulty interpreting null results are interrelated issues. These issues in turn make psychology less scientific by reducing the transparency of methods and findings.
These issues also undermine other scientific features. The Open Science Collaboration (2015) conducted replications of 100 studies from leading psychological journals, finding that less than half replicated successfully. This low level of reproducibility in itself makes psychology less scientific, and provides further evidence of the likely high prevalence and impact of the issues noted above (Asendorpf et al., 2013; Open Science Collaboration, 2015). Together, these problems impede self-correction, and make psychology's use of evidence and testability of theories less meaningful, as replications and negative tests are often unreported (Ferguson & Heene, 2012). This undermines psychology's ability to benefit humanity by explaining, predicting, and controlling phenomena.
However, while these issues make psychology less scientific, they do not make it non-science. Other sciences, including “hard sciences” like physics and biology, also suffer from issues like publication bias and low reproducibility and transparency (Alatalo, Mappes, & Edgar, 1997; Anderson, Burnham, Gould, & Cherry, 2001; McNutt, 2014; Miguel et al., 2014; Sarewitz, 2012; Service, 2002). Their presence is problematic and demands a response in any case, and may be more pronounced in psychology than in “harder” sciences, but it is not necessarily damning (see Fanelli, 2010). For example, the Open Science Collaboration (2015) did find a large portion of effects replicated, particularly effects whose initial evidence was stronger. Meanwhile, Klein et al. (2014a) found a much higher rate of replication for more established effects, compared to the Open Science Collaboration's quasi-random sample of recent findings. Both results highlight that, while psychology certainly has work to do to become more reliable, the field also has the capacity to scientifically progress towards truth and is already doing so to a meaningful extent.
Furthermore, psychologists themselves are highlighting these issues and researching and implementing solutions for them. Bakker et al. (2016) discuss the problem of low power and how to overcome it with larger sample sizes, reinforced by researchers habitually running power analyses prior to conducting studies and reviewers checking these analyses have been conducted. Nosek et al. (2015) proposed guidelines for promoting transparency by changing what journals encourage or require, such as replications, better reporting and sharing of materials and data, and pre-registration of studies and analysis plans. Pre-registration side-steps confirmation and hindsight bias and unreported, uncorrected multiple comparisons, as expectations and analysis plans are on record before data is gathered (Wagenmakers et al., 2012). Journals can also conditionally accept studies for publication based on pre-registered plans, minimising bias against null results by both journals and researchers. Such proposals still welcome exploratory analyses, but prevent these analyses being presented as confirmatory (Miguel et al., 2014). Finally, psychologists have argued for, outlined how to use, and adopted Bayesian statistics as an alternative to frequentist statistics (Ecker, Lewandowsky, & Apai, 2011; Wagenmakers et al., 2011). Bayesian statistics provide clear quantification of evidence for null hypotheses, combatting one source of publication bias and making testability of psychological claims more meaningful (Dienes, 2011; Francis, 2012). These proposals are beginning to take effect. For example, many journals and organisations are signatories to Nosek et al.'s guidelines. Additionally, the Centre for Open Science, led by the psychologist Brian Nosek, has set up online tools for researchers to routinely make their data, code, and pre-registered plans public (Miguel et al., 2014). This shows psychology self-correcting its practices, not just individual findings, to become more scientific.
I have argued here that claims that psychology is non-scientific may often reflect unworkable definitions of science and ignorance of what psychology actually involves. A family resemblance model of science overcomes the former issue by outlining features that sciences do not have to possess to be science, but do become more scientific by possessing. This model suggests psychology is a science because it generally exemplifies most scientific features; most importantly, it accumulates evidence publicly, incrementally, and self-critically to benefit humanity by explaining, predicting, and controlling phenomena. However, psychology is not as scientific as it could be. A variety of interrelated issues with researchers' and journals' practices and incentive structures impede the effectiveness and meaningfulness of psychology's scientific features. But failure to be perfectly scientific is not unique to psychology; it is universal among sciences. Science has achieved what it has because of its constant commitment to incremental improvement and self-correction of its own practices. In keeping with this, psychologists are researching and discussing psychology's issues and their potential solutions, and such solutions are being put into action. More work must be done, and more researchers and journals must act on and push for these discussions and solutions, but already it is clear both that psychology is a science and that it is actively working to become more scientific.
comment by MichaelA
· score: 1 (1 votes) · LW
) · GW
Alatalo, R. V., Mappes, J., & Elgar, M. A. (1997). Heritabilities and paradigm shifts. Nature, 385(6615), 402-403. doi:10.1038/385402a0
Anderson, D. R., Burnham, K. P., Gould, W. R., & Cherry, S. (2001). Concerns about finding effects that are actually spurious. Wildlife Society Bulletin, 29(1), 311-316.
Anderson, M. S., Martinson, B. C., & Vries, R. D. (2007). Normative dissonance in science: Results from a national survey of U.S. scientists. Journal of Empirical Research on Human Research Ethics: An International Journal, 2(4), 3-14. doi:10.1525/jer.2007.2.4.3
Anderson, M. S., Ronning, E. A., Vries, R. D., & Martinson, B. C. (2010). Extending the Mertonian norms: Scientists' subscription to norms of research. The Journal of Higher Education, 81(3), 366-393. doi:10.1353/jhe.0.0095
Asendorpf, J. B., Conner, M., Fruyt, F. D., Houwer, J. D., Denissen, J. J., Fiedler, K., … Wicherts, J. M. (2013). Recommendations for increasing replicability in psychology. European Journal of Personality, 27(2), 108-119. doi:10.1002/per.1919
Bakker, M., Hartgerink, C. H., Wicherts, J. M., & Han L. J. Van Der Maas. (2016). Researchers' intuitions about power in psychological research. Psychological Science, 27(8), 1069-1077. doi:10.1177/0956797616647519
Bandura, A. (2002). Environmental sustainability by sociocognitive deceleration of population growth. In P. Shmuck & W. P. Schultz (Eds.), Psychology of sustainable development (pp. 209-238). New York, NY: Springer.
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407-425. doi:10.1037/a0021524
Bennett, C. M., Miller, M. B., & Wolford, G. L. (2009). Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: an argument for multiple comparisons correction. Neuroimage, 47(Suppl 1), S125. doi:10.1016/s1053-8119(09)71202-9
Berezow, A. B. (2012, July 13). Why psychology isn't science. Los Angeles Times. Retrieved from http://latimes.com
Bringmann, L. F., & Eronen, M. I. (2015). Heating up the measurement debate: What psychologists can learn from the history of physics. Theory & Psychology, 26(1), 27-43. doi:10.1177/0959354315617253
Burke, D. (2014). Why isn't everyone an evolutionary psychologist? Frontiers in Psychology, 5. doi:10.3389/fpsyg.2014.00910
Campbell, H. (2012, July 17). A biologist and a psychologist square off over the definition of science. Science 2.0. Retrieved from http://www.science20.com
Chomsky, N. (1971). The case against BF Skinner. The New York Review of Books, 17(11), 18-24.
Cleland, C. E, & Brindell, S. (2013). Science and the messy, uncontrollable world of nature. In M. Pigliucci & M. Boudry (Eds.), The philosophy of pseudoscience (pp. 183-202). Chicago, IL: University of Chicago Press.
Confer, J. C., Easton, J. A., Fleischman, D. S., Goetz, C. D., Lewis, D. M., Perilloux, C., & Buss, D. M. (2010). Evolutionary psychology: Controversies, questions, prospects, and limitations. American Psychologist, 65(2), 110-126. doi:10.1037/a0018413
Dagher, Z. R., & Erduran, S. (2016). Reconceptualizing nature of science for science education: Why does it matter? Science & Education, 25, 147-164. doi:10.1007/s11191-015-9800-8
Delprato, D. J., & Midgley, B. D. (1992). Some fundamentals of B. F. Skinner's behaviorism. American Psychologist, 47(11), 1507-1520. doi:10.1037//0003-066x.47.11.1507
Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on?. Perspectives on Psychological Science, 6(3), 274-290. doi:10.1177/1745691611406920
Ecker, U. K., Lewandowsky, S., & Apai, J. (2011). Terrorists brought down the plane!—No, actually it was a technical fault: Processing corrections of emotive information. The Quarterly Journal of Experimental Psychology, 64(2), 283-310. doi:10.1080/17470218.2010.497927
Fanelli, D. (2010). “Positive” results increase down the hierarchy of the sciences. PLoS ONE, 5(4). doi:10.1371/journal.pone.0010068
Ferguson, C. J., & Heene, M. (2012). A vast graveyard of undead theories: Publication bias and psychological science's aversion to the null. Perspectives on Psychological Science, 7(6), 555-561. doi:10.1177/1745691612459059
Francis, G. (2012). Too good to be true: Publication bias in two prominent studies from experimental psychology. Psychonomic Bulletin & Review, 19(2), 151-156. doi:10.3758/s13423-012-0227-9
Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012). Correcting the past: Failures to replicate psi. Journal of Personality and Social Psychology, 103(6), 933-948. doi:10.1037/a0029709
Godin, G., Conner, M., & Sheeran, P. (2005). Bridging the intention-behaviour gap: The role of moral norm. British Journal of Social Psychology, 44(4), 497-512. doi:10.1348/014466604x17452
Hansson, S. O. (2013). Defining pseudoscience and science. In M. Pigliucci & M. Boudry (Eds.), The philosophy of pseudoscience (pp. 61-77). Chicago, IL: University of Chicago Press.
Ioannidis, J. P., Munafò, M. R., Fusar-Poli, P., Nosek, B. A., & David, S. P. (2014). Publication and other reporting biases in cognitive sciences: Detection, prevalence, and prevention. Trends in Cognitive Sciences, 18(5), 235-241. doi:10.1016/j.tics.2014.02.010
Irzik, G., & Nola, R. (2011). A family resemblance approach to the nature of science for science education. Science & Education, 20(7), 591-607. doi:10.1007/s11191-010-9293-4
Irzik, G., & Nola, R. (2014). New directions for nature of science research. In M. R. Matthews (Ed.), International Handbook of Research in History, Philosophy and Science Teaching (pp. 999-1021). Dordrecht: Springer.
John, L., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth-telling. Psychological Science, 23(5), 524-532. doi:10.1177/0956797611430953
Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2(3), 196-217. doi:10.1207/s15327957pspr0203_4
Kahneman, D. (2014). A new etiquette for replication. Social Psychology, 45(4), 310-311.
Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B., Bahník, S., Bernstein, M. J., Bocian, K., … Nosek, B. (2014a). Investigating variation in replicability: A “many labs” replication project. Social Psychology, 45(3), 142-152. doi:10.1027/a000001
Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B., Bahník, S., Bernstein, M. J., Bocian, K., … Nosek, B. (2014b). Theory building through replication: Response to commentaries on the “many labs” replication project. Social Psychology, 45(4), 299-311. doi:10.1027/1864-9335/a000202
Lebel, E. P., & Peters, K. R. (2011). Fearing the future of empirical psychology: Bem's (2011) evidence of psi as a case study of deficiencies in modal research practice. Review of General Psychology, 15(4), 371-379. doi:10.1037/a0025172
Lilienfeld, S. O. (2011). Distinguishing scientific from pseudoscientific psychotherapies: Evaluating the role of theoretical plausibility, with a little help from Reverend Bayes. Clinical Psychology: Science and Practice, 18(2), 105-112. doi:10.1111/j.1468-2850.2011.01241.x
Lilienfeld, S. O., Ritschel, L. A., Lynn, S. J., Cautin, R. L., & Latzman, R. D. (2013). Why many clinical psychologists are resistant to evidence-based practice: Root causes and constructive remedies. Clinical Psychology Review, 33(7), 883-900. doi:10.1016/j.cpr.2012.09.008
Mahner, M. (2013). Science and pseudoscience: How to demarcate after the (alleged) demise of the demarcation problem. In M. Pigliucci & M. Boudry (Eds.), The philosophy of pseudoscience (pp. 29-43). Chicago, IL: University of Chicago Press.
McNutt, M. (2014). Reproducibility. Science, 343(6168), 229. doi:10.1126/science.1250475
Michell, J. (2013). Constructs, inferences, and mental measurement. New Ideas in Psychology, 31(1), 13-21. doi:10.1016/j.newideapsych.2011.02.004
Miguel, E., Camerer, C., Casey, K., Cohen, J., Esterling, K. M., Gerber, A., … Laan, M. V. (2014). Promoting transparency in social science research. Science, 343(6166), 30-31. doi:10.1126/science.1245317
Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., … & Contestabile, M. (2015). Promoting an open research culture. Science, 348(6242), 1422-1425.
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
Popper, K. (1957). Philosophy of science: A personal report. In C. A. Mace (Ed.), British Philosophy in Mid-Century (155-160). London: Allen and Unwin.
Pigliucci, M. (2013). The demarcation problem: A (belated) response to Laudan. In M. Pigliucci & M. Boudry (Eds.), The philosophy of pseudoscience (pp. 9-28). Chicago, IL: University of Chicago Press.
Rhodes, R. E., & Bruijn, G. D. (2013). How big is the physical activity intention-behaviour gap? A meta-analysis using the action control framework. British Journal of Health Psychology, 18(2), 296-309. doi:10.1111/bjhp.12032
Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the future: Three unsuccessful attempts to replicate Bem’s “retroactive facilitation of recall” effect. PLoS ONE, 7(3), e33423. doi:10.1371/journal.pone.0033423
Sarewitz, D. (2012). Beware the creeping cracks of bias. Nature, 485(7397), 149.
Service, R. F. (2002). Scientific misconduct: Bell Labs fires star physicist found guilty of forging data. Science, 298(5591), 30-31. doi:10.1126/science.298.5591.30
Sheeran, P. (2002). Intention—behavior relations: A conceptual and empirical review. European Review of Social Psychology, 12(1), 1-36. doi:10.1080/14792772143000003
Skinner, B. F. (1987). Whatever happened to psychology as the science of behavior? American Psychologist, 42(8), 780-786. doi:10.1037/0003-066x.42.8.780
Stricker, G. (1997). Are science and practice commensurable? American Psychologist, 52(4), 442-448. doi:10.1037//0003-066x.52.4.442
Wagenmakers, E., Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100(3), 426-432. doi:10.1037/a0022790
Wagenmakers, E., Wetzels, R., Borsboom, D., van der Maas, H. L. J., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7(6), 632-638. doi:10.1177/1745691612463078
Zimbardo, P. G. (2012). Does psychology make a significant difference in our lives?. In Applied Psychology (pp. 39-64). Psychology Press.