Do Antidepressants work? (First Take)
post by Jacob Goldsmith (jacgoldsm) · 2025-01-12T17:11:55.417Z · LW · GW · 1 commentsContents
How is depression measured? What does it mean for antidepressants to work? What do these studies find? What does BCH find? How do you reconcile this? Simulation Conclusion References None 1 comment
I've been researching the controversy over whether antidepressants truly work or whether they are not superior to a placebo. The latter possibility really contains two possibilities itself: either placebos are effective at treating depression, or the placebo effect reflects mean reversion. Here, the term "antidepressant" refers to drugs classified as SSRIs and SNRIs.
Some stylized facts from the literature of RCTs:
- Both antidepressants and placebos are associated with a reduction in symptoms of depression
- Antidepressants are associated with a larger reduction in depressive symptoms than placebos
- This difference in efficacy is attenuated when "enhanced" placebos are used, that is placebos that are designed to provoke side effects to persuade the subject that they are being treated with a real medication
- Studies on antidepressants broadly measure either an average treatment effect according to some continuous scale or a binary treatment according to a cutoff for success/failure of treatment (or both). Universally, if you looked only at the former measure you would conclude that the effect of antidepressants on depression is clinically insignificant, while if you looked only at the latter measure you would come to a more favorable conclusion.
How is depression measured?
Most studies of antidepressant efficacy use the Hamilton Depression Rating Scale (HDRS) to measure changes in depressive symptoms from the baseline, although entrance into the clinical trials themselves is often not based on this measure. There are several versions of this scale, but the most popular one is HDRS-17, which is a seventeen question survey administered by a clinician. Scores on HDRS-17 vary from zero to 57. Scores of 0-7 are generally considered normal, while a score of 20 or greater indicates at least moderate depression.
You can look at the questions on HDRS-17 here. To my eye, it seems to strangely underweight feelings of subjective psychological distress that define depression, and includes a lot of symptoms that are only contingently associated with what we actually mean when we talk about depression. I think the heterogeneity in types of questions is a potential problem in using it to quantitatively assess the effectiveness of interventions on depressive disorder, but I'd like to leave that aside for now for the most part.
What does it mean for antidepressants to work?
Most studies that use a scale like HDRS-17 to look at the effectiveness of interventions on depression either define a cutoff of how much improvement counts as partial or full remission and use that cutoff as a binary indicator for success, or look at the average treatment effect in terms of points of improvement for the treatment arm relative the placebo arm. In the former case, a decrease of 50% of baseline HDRS-17 points or more is often used as the cutoff, corresponding to at least partial remission of depression. In the continuous treatment effect, an improvement of three points or more is often considered to be a clinically significant effect, although that might be too low (three point swings in HDRS would usually be considered noise for an individual patient in a clinical setting).
What do these studies find?
In this essay, I'm going to focus on one particular meta-analysis of the effectiveness of fluoxetine (Prozac) by Bech P, Cialdella P, Haugh MC, et al in BJP. However, I believe that this study is representative of most similar analyses, and my discussion of it is broadly applicable. Other studies are discussed in a review of meta-analyses in the Annals of Hospital Psychiatry and a critical review of the literature in BMJ. The pattern that studies with a quantitative endpoint found a small and clinically insignificant improvement seems to hold nearly universally as far as I can find.
Conversely, most studies that examine the efficacy of antidepressants in terms of a cutoff on percent improvement show a clinically significant effect. Sometimes, the same meta-analysis includes both endpoints. The Bech, Cialdella, Haugh et al (heretofore BCH) analysis that I want to look into more closely is one of them.
What does BCH find?
BCH is a meta-analysis of 30 studies, some of which compare fluoxetine to a placebo, and others that compare it to old Tricyclic antidepressants (TCAs). I want to focus on the comparisons to the placebo, of which there were 16, all conducted in the US. The authors consider three measures of treatment: Intention to treat, which counts any study dropouts as having failed the study, Efficacy analysis, which takes anyone who completed four weeks of treatment and uses their last check-in as the post-treatment result, and End-point, which takes anyone who had any post-treatment visit and uses their last check-in as their post-treatment result. The results are summarized in the following table from the paper:
I think the intent to treat measure is the best since antidepressants are not supposed to continue to work after discontinuation, so anyone who does not continue the treatment throughout the post-baseline period should be expected not to show any long-term improvement (though for some, short term relief could be a meaningful benefit). However, only the third measure (End-point) is used for the average treatment effect analysis. In every case, the results seem pretty clinically significant: to be a responder, you have to halve your HDRS-17 score relative to the baseline. If a treated subject is 13.6 or 21.4 or 16.8 percentage points more likely to be a responder by this definition than a recipient of the placebo, I think it is reasonable to call this a positive result for fluoxetine.
The average treatment effect analysis uses the End-point measure and simply takes the mean difference in HDRS-17 scores from before and after treatment and compares the difference in differences for the treatment arm and the survey arm. Unfortunately, it does not report the original averages nor the single-differenced results—it only displays the last difference-in-differences figure. The individual study results and the average are displayed below:
The average treatment effect across all studies was an improvement for the fluoxetine relative to the placebo of 0.303 HDRS-17 points. Recall that the HDRS-17 is out of 57 points, most of the participants probably started with a score of 20 or greater, and a score difference of three (ten times what the study shows) is not usually considered clinically significant. Clearly, this result is a pretty precisely estimated zero.
How do you reconcile this?
It is of course possible to get different results when you analyze a variable as continuous versus if you dichotomize it. Indeed, it's possible to make a dichotomous result look material even with a small change by choosing the right threshold. For example, if 50% of treated participants improved their score by two points and 50% saw no effect, while no placebo respondents had a change, then choosing a threshold of one point of improvement would show that the treatment group was 50 percentage points more likely to respond to treatment.
However, I don't think this is sufficient to explain the discrepancy in this case. The only way I can come up with numbers that approach what the study found is to assume an extremely heterogeneous treatment effect, to the point where the heterogeneity should be a first-order object of concern more than the average treatment effect itself.
Simulation
There are a lot of parameters to the study that aren't given, so I made them up. However, essentially all reasonable parameters give about the same result. Feel free to try to find some assumptions under which the difference in treatment versus control remission rates approaches 16.8 percentage points.
Note that the only thing that the study does state is that the mean difference in change between the pre and post period for treated versus placebo respondents was 0.303 HDRS-17 points. Here's some R code that shows my approach. In the following code, I assume a homogeneous treatment effect.
ENTRY_CUTOFF <- 20L
MAX_HDRS_SCORE <- 57L
AVERAGE_HDRS_SCORE <- 30L
MEAN_PLACEBO_CHANGE <- 8L
set.seed(100)
entrants <- rnorm(1000000L, mean=AVERAGE_HDRS_SCORE,sd=4)
entrants <- ifelse(entrants < ENTRY_CUTOFF, ENTRY_CUTOFF, entrants)
entrants <- ifelse(entrants > MAX_HDRS_SCORE, MAX_HDRS_SCORE, entrants)
treat <- entrants
pla <- entrants
treatment_effect <- rep_len(-0.303,1000000)
placebo_effect <- rnorm(1000000,mean=-1*MEAN_PLACEBO_CHANGE,sd=4)
treatment_diff <- treatment_effect + placebo_effect
post_treat <- treat + treatment_diff
post_pla <- pla + placebo_effect
post_treat <- ifelse(post_treat < 0, 0, post_treat)
post_pla <- ifelse(post_pla < 0, 0, post_pla)
diff_diff <- post_treat - post_pla
mean(diff_diff)
# partial remission: 50% decline
treatment_comp <- (post_treat - treat) / treat
treat_remission <- ifelse(treatment_comp < -0.5,1,0)
tm <- mean(treat_remission)
tm*100
pla_comp <- (post_pla - pla) / pla
pla_remission <- ifelse(pla_comp < -0.5,1,0)
tp <- mean(pla_remission)
tp*100
print((tm - tp)*100)
# 0.8337%
With these parameters, I find that treated participants should be 0.84 percentage points more likely to be in remission than placebo respondents.
In the following code, I allow for a heterogeneous treatment effect. To get to something approaching the results of the study, I have to let the treatment effect itself (recall, its mean is 0.303) to be normally distributed with a massive standard deviation of 7.75.
ENTRY_CUTOFF <- 20L
MAX_HDRS_SCORE <- 57L
AVERAGE_HDRS_SCORE <- 30L
MEAN_PLACEBO_CHANGE <- 8L
set.seed(100)
entrants <- rnorm(1000000L, mean=AVERAGE_HDRS_SCORE,sd=4)
entrants <- ifelse(entrants < ENTRY_CUTOFF, ENTRY_CUTOFF, entrants)
entrants <- ifelse(entrants > MAX_HDRS_SCORE, MAX_HDRS_SCORE, entrants)
treat <- entrants
pla <- entrants
treatment_effect <- rnorm(1000000,mean=-0.303, sd=7.75)
placebo_effect <- rnorm(1000000,mean=-1*MEAN_PLACEBO_CHANGE,sd=4)
treatment_diff <- treatment_effect + placebo_effect
post_treat <- treat + treatment_diff
post_pla <- pla + placebo_effect
post_treat <- ifelse(post_treat < 0, 0, post_treat)
post_pla <- ifelse(post_pla < 0, 0, post_pla)
diff_diff <- post_treat - post_pla
mean(diff_diff)
# partial remission: 50% decline
treatment_comp <- (post_treat - treat) / treat
treat_remission <- ifelse(treatment_comp < -0.5,1,0)
tm <- mean(treat_remission)
tm*100
pla_comp <- (post_pla - pla) / pla
pla_remission <- ifelse(pla_comp < -0.5,1,0)
tp <- mean(pla_remission)
tp*100
print((tm - tp)*100)
# 16.8374%
The implications of this are very strange. Under such a heterogeneous treatment effect, about 48% of respondents were made worse off by the drug. Among those who were made worse off, their HDRS-17 score was increased by an average of 6.07 points, 20 times the average treatment effect in the opposite direction. Among the 52% of those who were made better off, their HDRS-17 score decreased by an average of 6.29 points. This is a histogram of the treatment effect (the actual assumed effect of the drug, not including mean reversion or the placebo effect):
The overwhelming story here is the heterogeneity, not whether the mean is different from zero. Many patients appear to be helped quite a bit by fluoxetine, and almost as many are harmed.
Conclusion
There are several possibilities that I can think of from this analysis:
- I made a mistake with my simulation
- The simulation is correct but is missing some structural feature of the study that isn't explained in the write-up
- The distribution of depression scores, the distribution of placebo effects, or the distribution of treatment effects is so dissimilar from my assumptions that it is actually possible to get the results that the authors did without having massive heterogeneity of treatment effects
- My simulation and analysis are correct and the treatment effect of fluoxetine is massively heterogeneous.
If the last option is true, then the endpoints of this study, and most other studies on antidepressants in general, are entirely missing the forest for the trees. A small average treatment effect and a difference in binary remission rates are masking the fact that these drugs are highly effective for some and very harmful for others, at least in the short term. Is it possible to determine a priori who will fall in which group? Is the heterogeneity limited to a subset of the questions on the HDRS-17, like sleep or appetite? Do people who benefit from fluoxetine also benefit from duloxetine and other drugs and vice-versa? Without more information about heterogeneity, it seems very difficult to come to any conclusions about the effectiveness of SSRI antidepressants.
References
- Bech P, Cialdella P, Haugh MC, et al. Meta-analysis of randomised controlled trials of fluoxetine v. placebo and tricyclic antidepressants in the short-term treatment of major depression. British Journal of Psychiatry. 2000;176(5):421-428. doi:10.1192/bjp.176.5.421
- Jakobsen JC, Gluud C, Kirsch I. Should antidepressants be used for major depressive disorder? BMJ Evid Based Med. 2020 Aug;25(4):130. doi: 10.1136/bmjebm-2019-111238. Epub 2019 Sep 25. PMID: 31554608; PMCID: PMC7418603.
- Hamilton M. A rating scale for depression. J Neurol Neurosurg Psychiatry 1960; 23:56–62
- Moncrieff J. What does the latest meta-analysis really tell us about antidepressants? Epidemiol Psychiatr Sci. 2018 Oct;27(5):430-432. doi: 10.1017/S2045796018000240. Epub 2018 May 28. PMID: 29804550; PMCID: PMC6999018.
- Rossi A, Barraco A, Donda P. Fluoxetine: a review on evidence based medicine. Ann Gen Hosp Psychiatry. 2004 Feb 12;3(1):2. doi: 10.1186/1475-2832-3-2. PMID: 14962351; PMCID: PMC356924.
1 comments
Comments sorted by top scores.
comment by AnthonyC · 2025-01-13T01:10:18.832Z · LW(p) · GW(p)
If the last option is true, then the endpoints of this study, and most other studies on antidepressants in general, are entirely missing the forest for the trees.
This seems like the most likely scenario to me. I appreciate the need and desire to have a conclusive, defensible, formally rigorous answer to this question, but also, in some sense the answer very obviously has to end up being "Yes, but both 'depression' and 'antidepressant' are probably somewhat confused or overloaded categories that none of our available endpoints or diagnostic metrics demultiplex well enough to satisfy anyone or match patients' lived experiences."
I'm one patient who has been on bupropion for depression for ~6 years, and I truly cannot imagine a study that would convince me that it "didn't work" for me, in the ways that actually matter. The effects on my mind kick in sharply, scale smoothly with dose, decay right in sync with half-life in the body, and are clearly noticeable not just internally for my mood but externally in my speech patterns, reaction speeds, ability to notice things in my surroundings, short term memory, and facial expressions.