Shall we count the living or the dead?
post by Anders_H · 2021-06-14T00:38:22.968Z · LW · GW · 19 commentsContents
19 comments
This is a linkpost for an entry in my Substack newsletter at https://countheliving.substack.com/p/shall-we-count-the-living-or-the . In this entry, we announce a new arXiv preprint, "Shall we count the living or the dead". This preprint builds on my earlier work that has previously been discussed on Less Wrong [LW · GW], and which led to a response first on Less Wrong [LW · GW]and later in the European Journal of Epidemiology by Carlos Cinelli and Judea Pearl.
The Substack entry contains a link to an animated video on YouTube, which explains a simplified version of the argument
19 comments
Comments sorted by top scores.
comment by ChristianKl · 2021-06-14T18:06:50.935Z · LW(p) · GW(p)
It would be great if we would have a bunch of physicians that practice prediction-based medicine [LW · GW]and could run a trial whether physicians that use the status quo method are better or worse then those that use your method to predict side effects for their patients.
While I do appreciate the theoretical argument it feels like our medical system is deeply flawed when it doesn't provide an easy way to run a practical way to run experiments whether the technique you propose is helpful for practicing physicians.
comment by George3d6 · 2021-06-14T11:02:33.775Z · LW(p) · GW(p)
This to me seems like a non-issue.
The core problem here is "doctors don't know how to interpret basic probabilities", the solution to this is deregulation in order to hoist the work of decision trees from men.
Discussions like this one are akin to figuring out how to get paedophiles to wear condom more often, in principle they could be justified if the benefits/cost were proportionally immense, but they are a step in a tangent direction and moving away focus from the core issue (which is, again, why are your symptoms, traits and preferences not weighted by a decision tree in order to determine medication)
This more broadly applies to any mathematical computation that is left to the human brain to make instead of being offloaded to a computer. It's literally insane that a small minority of very protectionist professions are still allowed (and indeed, to some extent forced) to do this... it's like accountants being forced to make calculation with pen and paper instead of introducing the numbers into a formula in excel.
Replies from: Anders_H↑ comment by Anders_H · 2021-06-14T13:14:15.916Z · LW(p) · GW(p)
No. This is not about interpretation of probabilities. It is about choosing what aspect of reality to rely on for extrapolation. You will get different extrapolations depending on whether you rely on a risk ratio, a risk difference or an odds ratio. This will lead to real differences in predictions for what happens under intervention.
Even if clinical decisions are entirely left to an algorithm, the algorithm will need to select a mathematical object to rely on for extrapolation. The person who writes the algorithm needs to tell the algorithm what to use, and the answer to that question is contested. This paper contributes to that discussion, and proposes a concrete solution. One that has been known for 65 years, but never used in practice.
Replies from: George3d6↑ comment by George3d6 · 2021-06-14T20:02:31.554Z · LW(p) · GW(p)
It is ultimately about interpretation.
This paradigm doesn't matter if the physician has in mind a cost/benefit matrix for the treatment, in which it would be fairly easy to plug in raw experimental data no matter how the researchers chose to analyze it.
More broadly, see the comment by ChristianKl.
Replies from: ChristianKl, Anders_H↑ comment by ChristianKl · 2021-06-14T20:58:52.309Z · LW(p) · GW(p)
Having cost/benefit in mind is not enough. If you don't use a heuristic like the one Anders writes about, you need either causal models or something like prediction-based medicine which gives you a way to decide which of two algorithms for decision making is better by looking at the Briers score (or a similar statistic).
↑ comment by Anders_H · 2021-06-14T20:18:57.173Z · LW(p) · GW(p)
I very emphatically disagree with this.
You are right that once you have a prediction for risk if untreated, and a prediction risk if treated, you just need a cost/benefit analysis. However, you won't get to that stage without a paradigm for extrapolation, whether implicit or explicit. I prefer making that paradigm explicit.
If you want to plug in raw experimental data, you are going to need data from people who are exactly like the patient in every way. Then, you will be relying on a paradigm for extrapolation which claims that the conditional counterfactual risks (rather than the magnitude of the effect) can be extrapolated from the study to the patient. It is a different paradigm, and one that can only be justified if the conditioning set includes every cause of the outcome.
In my view, this is completely unrealistic. I prefer a paradigm for extrapolation that aims to extrapolate the scale-specific magnitude of the effect. If this is the goal, our conditioning set only needs to include those covariates that predict the magnitude of the effect of treatment, which is a small subset of all covariates that cause the outcome.
On this specific point, my view is consistent with almost all thinking in medical statistics, with the exception of some very recent work in causal modeling (who prefer the approach based on counterfactual risks). My disagreement with this work in causal modeling is at the core of my last discussion about this on Less Wrong. See for example "Effect Heterogeneity and External Validity in Medicine" [LW · GW] and the European Journal of Epidemiology paper that it links to
comment by JBlack · 2021-06-15T03:32:15.751Z · LW(p) · GW(p)
There is a serious underlying causal model difference here that cannot be addressed in a purely statistical manner.
The video proposes a model in which some fraction of the population have baseline allergies (yielding a 1% baseline allergic reaction rate over the course of the study), and independently from that the treatment causes allergic reactions in some fraction of people (just over 1% during the same period). If this model is universally correct, then the argument is reasonable.
Do we know that the side effects are independent in this way? It seems to me that we cannot assume this model of independence for every possible combination of treatment and side effect. If there is any positive correlation, then the assumption of independence will yield an extrapolated risk that is too low.
Putting this in terms of causal models, the independence model looks like: Various uncontrolled and unmeasured factors E in the environment sometimes cause some observed side effect S, as observed in the control group. Treatment T also sometimes causes S, regardless of E.
This seems too simplistic. It seems much more likely that E interacts with unknown internal factors I that vary per-person to sometimes cause side effect S in the control group. Treatment T also interacts with I to sometimes produce S.
If you already know that your patient is in a group observed to be at extra risk of S compared with the study population, it is reasonable to infer that this group has a different distribution of I putting them at greater risk of S than the studied group, even if you don't know how much or what the gears-level mechanism is. Since T also interacts with I, it is reasonable as a general principle to expect a greater chance of T causing the side effect in your patient than independence would indicate.
So the question "shall we count the living or the dead" seems misplaced. The real question is to what degree side effects are expected to have common causes or susceptibilities that depend upon persistent factors.
Allergic reactions in particular are known to be not independent, so an example video based on some other side effect may have been better. I can't really think of one though, which does sort of undermine the point of the article.
Replies from: Anders_H↑ comment by Anders_H · 2021-06-15T10:24:10.205Z · LW(p) · GW(p)
You are correct that someone who has one allergy may be more likely to have an other allergy, and that this violates the assumptions of our model. Our model relies on a strong independence assumption, there are many realistic cases where this independence assumption will not hold. I also agree that the video uses an example where the assumption may not hold. The video is oversimplified on purpose, in an attempt to get people interested enough to read the arXiv preprint.
If there is a small correlation between baseline risk and effect of treatment, this will have a negligible impact on the analysis. If there is a moderate correlation, you will probably be able to bound the true treatment effect using partial identification methods. If there is strong correlation, this may invalidate the analysis completely.
The point we are making is not that the model will always hold exactly. Any model is an approximation. Let's suppose we have three choices:
- Use a template for a causal model that "counts the living", think about all the possible biological reasons that this model could go wrong, represent them in the model if possible, and account for them as best you can in the analysis
- Use a template for a causal model that "counts the dead", think about all the possible biological reasons that this model could go wrong, represent them in the model if possible, and account for them as best you can in the analysis
- Use a model that is invariant to whether you count the living or the dead. This cannot be based on a multiplicative (relative risk) parameter.
The third approach will not be sensitive to the particular problems that I am discussing, but all the suggested methods of this type have their own problems. I have written this earlier, my view is that these problems are more troubling than the problems with the relative risk models.
What we are arguing in this preprint, is that if you decide to go with a relative risk model, you should choose between (1) and (2) based on the principles suggested by Sheps, and then reason about problems with this model and how it can be addressed in the analysis, based on the principles that you have correctly outlined in your comment.
I can assure you that if you decide to go with a multiplicative model but choose the wrong "base case", then all of the problems you have discussed in your comments will be orders of magnitude more difficult to deal with in any meaningful way. In other words, it is only after you make the choice recommended by Sheps that it even becomes possibly the meaningfully analyze the reasons for deviation from effect homogeneity...
comment by JenniferRM · 2021-06-14T17:37:29.188Z · LW(p) · GW(p)
Is there a pithy summary which explains the basic object level idea in a sentence of four? Like when "Mindel C. Sheps proposed a principled solution", what was it? Is her solution the same as yours, or not?
It kinda seems like the idea is "use causal graphs to formally reason about diagnosis and treatment decisions" but I don't think causal graphs were common in 1958.
Or maybe the idea is that some treatments are like vitamins (maybe help lots of people a little, but could be poison by accident, or at the wrong dose, or for sensitive weirdos) and others are like chemotherapy (terrible in general, and yet plausibly better than non-treatment for very specific diagnoses) and these two kinds of treatment either need different reasoning methods to be properly formally justified, or else need a unified modeling tool that is powerful enough to express the important differences?
Or is the idea something else entirely? I can't seem to skim my way to the critical bit and then feel certain "I got the gist".
Replies from: Anders_H, ChristianKl↑ comment by Anders_H · 2021-06-14T18:45:45.487Z · LW(p) · GW(p)
Suppose you summarize the effect of a drug using a relative risk (a multiplicative effect parameter relating the probability of the event if treated with the probability of the event if untreated), and consider this multiplicative parameter to represent the "magnitude of the effect"
The natural thing for a clinician to do will be to assume that the magnitude of the effect is the same in their own patients. They will therefore rely on this specific scale for extrapolation from the study to their patients. However, those patients may have a different risk profile.
When clinicians do this, they will make different predictions depending on whether the relative risk is based on the probability of the event, or the probability of the complement of the event.
Sheps' solution to this problem is the same as mine: If the intervention results in a decrease to the risk of the outcome, you should use the probability of the event to construct the relative risk, whereas if the intervention increases the risk of the event, you should use the probability of the complement of the event
Replies from: JenniferRM↑ comment by JenniferRM · 2021-06-14T22:42:30.150Z · LW(p) · GW(p)
Thanks!
My next question was whether this was "just pedagogy and communication to help people avoid dumb calculation mistakes" or a real and substantive issue, and then I watched the video...
And it is nice that the video is only 4:51 seconds long and works "just as well" on 2X...
And... I think basically the claim is NOT just pedagogical, but substantive, but it was hard to notice.
I've swirled your content around, and in doing so I feel like I've removed the stats confusion and turned it around so it sharpens the way the core question is about physical reality and modeling intuitions... Here is an alternative abstract that I offer for your use (as is, or with edits) if you want it <3
Imagine that an enormous high powered study samples people in general, and 1% of the control group has allergies on placebo, while 2% on "the real drug" have an allergic reaction. Then a specific patient from a subpopulation where 10% of the people have separately been measured to be predisposed to allergies, comes to a doctor who then tries to weigh treatment benefits vs complication risks. Assume the doctor is rushed, and can't do a proper job (or perhaps lacks a relevant rapid test kit and/or lacks the ability to construct tests from first principles because of a brutally restrictive regulatory environment in medicine) and so can only go off of subpopulation data without measuring the patient for direct mechanistic pre-disposing allergy factors. What is the maximally structurally plausible probability of an allergic reaction, as a complication for that patient, in response to treatment: ~2% or ~11% or ~20%? This fact of the matter, at this clinical decision point, could itself be measured empirically. Textbooks that speak to this say ~20%, but those textbooks are wrong because they have no theory of external reality and are basically just cargo-culting bad practices that have been bad for over half a century. The simplest Pearlian causal model (see figure 1) should be preferred based on Occam's razor and says ~11%. Mindel C. Sheps proposed the right answer using related intuitions in 1958 but she has been ignored (and sometimes made fun of) because our field is systematically bad at reasoning thoughtfully about physical reality. This paper aims to correct this defect in how the field reasons.
Does this seem like a friendly proposal (friendly to you, not friendly to the field, of course) for a better abstract, that focuses on a concrete example while pushing as hard as possible on the central substantive issue?
I admit: peer reviewers would probably object to this.
Also I did intentionally "punch it up" more than might be justified in hopes that you'll object in an informative way here and now. My abstract is PROBABLY WRONG (one example: I know your figure 1 is not a pearlian model) but I hope it is wrong in the way a bold sketch is less accurate than a more painterly depiction, while still being useful to figure out what the central message can or should be.
Replies from: Anders_H, ChristianKl↑ comment by Anders_H · 2021-06-15T10:35:11.350Z · LW(p) · GW(p)
Thank you so much for writing this! Yes, this is mostly an accurate summary of my views (although I would certainly phrase some things differently). I just want to point out two minor disagreements:
- I don't think the problem is that doctors are too rushed to do a proper job, I think the patient-specific data that you would need is in many cases theoretically unobservable, or at least that we would need a much more complete understanding of biological mechanisms in order to know what to test the patients for in order to make a truly individualized decision. At least for the foreseeable future, I think it will be impossible for doctors to determine which patients will benefit on an individual level, they will be constrained to using the patient's observables to put them in a reference group, and then use that reference group to predict risk based on observations from other patients in the same reference group
- I am not entirely convinced that the Pearlian approach is the most natural way to handle this. In the manuscript, I use "modern causal models" as a more general term that also includes other types of counterfactual causal models. Of course, all these models are basically isomorphic, and Cinelli/Pearl did show in response to my last paper that it is possible to do the same thing using DAGs. I am just not at all convinced that the easiest way to capture the relevant intuition is to use the Pearl's graphical representation of the causal models.
↑ comment by ChristianKl · 2021-06-15T00:37:09.919Z · LW(p) · GW(p)
Assume the doctor is rushed, and can't do a proper job (or perhaps lacks a relevant rapid test kit and/or lacks the ability to construct tests from first principles because of a brutally restrictive regulatory environment in medicine) and so can only go off of subpopulation data without measuring the patient for direct mechanistic pre-disposing allergy factors
Even if the doctor could run all the tests they desire on the patient, the orginal study that said 1% of the control group and 2% on the real drug does not contain information about what pre-disposing allergy factors the patients in the trial had.
Replies from: JenniferRM↑ comment by JenniferRM · 2021-06-15T08:16:53.836Z · LW(p) · GW(p)
The official study is neither the beginning not the end of knowledge. If people were being really competent and thorough, the study could have have collected all kinds of additional patient metadata.
The patient's body is made of atoms that move according to physical laws. It is basically a machine. With the correct mechanistic modeling (possibly very very complicated) grounded in various possible measurements (some simple, some maybe more complicated) all motions of the atoms of the body are potentially subject to scientific mastery and intentional control.
From patient to patient, there are commonalities. Places where things work the same. This allows shortcuts... transfer of predictions from one patient to another.
Since the body operates as it does for physical reasons, if a patient had a unique arrangement of atoms, that could produce a unique medical situation...
...and yet the unique medical situation will still obey the laws of physics and chemistry and biochemistry and so on. From such models, with lots of data, one could still hypothetically be very very confident even about how to treat a VERY unique organism.
Veterinarians tend to be better at first principles medicine than mere human doctors. There are fewer vet jobs, and fewer vet schools, and helping animals has more of a prestige halo among undergrads than helping humans, and the school applications are more competitive, and the domain itself is vastly larger, so more generic reasoning tends to be taught and learned and used.
If a single human doctor was smart and competent and thorough, they could have calibrated hunches about what tests the doctors who ran the "1% and 2% study" COULD have performed.
If a single doctor was smart and competent and thorough, they could look at the study that said "overall in humans in general in a large group: side effect X was 1% in controls and 2% with the real drug" AND they could sequence the entire genome of the patient and make predictions from this sequence data. The two kinds of data could potentially be reconciled and used together for the specific patient.
BUT, if a single doctor was smart and competent and thorough, they could ALSO (perhaps) have direct access to the list of all allergic reactions the patient is capable of generating because they directly sampled the antibodies in the patient, and now have a computerized report of that entire dataset and what it means.
Heck, with alphafold in the pipeline now, a hypothetical efficacy study could hypothetically have sequenced every STUDY patient, and every patient's unique gene sequences and unique drug-target-folding could be predicted.
A study output might not be "effective or not" but rather just be a large computer model where the model can take any plausible human biodata package and say which reactions (good, bad, or interesting) the drug would have for the specific person with 99.9% confidence one way or the other.
Drugs aren't magic essences. Their "non-100%" efficacy rates are not ontologically immutable facts. Our current "it might work, it might not" summaries of drug effects... are caused partly by our tolerance for ignorance, rather than only by the drug's intrinsically random behavior.
We can model a drug as a magic fetish the patient puts in their mouth, and which sometimes works or somethings doesn't, as a brute fact, characterized only in terms of central tendencies...
...but this modeling approach is based on our limits, which are not set in stone.
Science is not over. Our doctors are still basically like witch doctors compared to the theoretical limits imposed by the laws of physics.
The current barriers to good medical treatment are strongly related to how much time and effort it takes to talk to people and follow up and measure things... and thus they are related to wealth, and thus economics, and thus economic regulatory systems.
Our government and universities are bad, and so our medical regulations are bad, and so our medicine is bad. It is not against the laws of physics for medicine to be better than this.
Concretely: do you have a physical/scientific hunch here? It kinda seems like you're advocating "2% because that's what the study said"?
Replies from: ChristianKl, Anders_HWhat is the maximally structurally plausible probability of an allergic reaction, as a complication for that patient, in response to treatment: ~2% or ~11% or ~20%?
↑ comment by ChristianKl · 2021-06-15T08:50:09.784Z · LW(p) · GW(p)
The patient's body is made of atoms that move according to physical laws.
Yes, but making treatment decisions based pathophysiological theories goes counter to what evidence-based medicine is about. The idea of this method is that it's going to be used by doctors practicing evidence-based medicine.
You can argue that evidence-based medicine is a flawed paradigm and doctors should instead practice physical-law-based medicine (or whatever you want to call it) but that's a more general discussion then the one about this particular heuristic.
↑ comment by Anders_H · 2021-06-15T12:25:54.404Z · LW(p) · GW(p)
This comment touches on the central tension between the current paradigm in medicine, i.e. "evidence-based medicine" and an alternative and intuitively appealing approach based on a biological understanding of mechanism of disease.
In evidence-based medicine, decisions are based on statistical analysis of randomized trials; what matters is whether we can be confident that the medication probabilistically has improved outcomes when tested on humans as a unit. We don't care really care too much about the mechanism behind the causal effect, just whether we can be sure it is real.
The exaggerated strawman alternative approach to EBM would be Star Trek medicine, where the ship's doctor can reliably scan an alien's biology, determine which molecule is needed to correct the pathology, synthesize that molecule and administer it as treatment.
If we have a complete understanding of what Nancy Cartwright calls "the nomological machine", Star Trek medicine should work in theory. However, you are going to need a very complete, accurate and detailed map of the human body to make it work. Given the complexity of the human body, I think we are very far from being able to do this in practice.
There have been many cases in recent history where doctors believed they understood biology well enough to predict the consequences, yet were proved wrong by randomized trials. See for example Vinay Prasad's book "Ending Medical Reversal".
My personal view is that we are very far from being able to ground clinical decisions in mechanistic knowledge instead of randomized trials. Trying to do so would probably be dangerous given the current state of biological understanding. However, we can probably improve on naive evidence-based medicine by carving out a role for mechanistic knowledge to complement data analysis. Mechanisms seems particularly important for reasoning correctly about extrapolation, the purpose of my research program is to clarify one way such mechanisms can be used. It doesn't always work perfectly, but I am not aware of any examples where an alternative approach works better.
↑ comment by ChristianKl · 2021-06-14T17:57:02.318Z · LW(p) · GW(p)
Is there a pithy summary which explains the basic object level idea in a sentence of four? Like when "Mindel C. Sheps proposed a principled solution", what was it? Is her solution the same as yours, or not?
Have you watched the video? To me it seems like a clear presentation of the basic idea.
It kinda seems like the idea is "use causal graphs to formally reason about diagnosis and treatment decisions" but I don't think causal graphs were common in 1958.
No, the causal graphs are used to justify that the heuristic proposed in the video is better then the status quo.
Replies from: Ericf↑ comment by Ericf · 2021-06-14T18:21:26.695Z · LW(p) · GW(p)
Asking someone to watch a video is rude and filters your audience to "people with enough time to consume content slowly, and an environment that allows audio/streaming"
Replies from: ChristianKl↑ comment by ChristianKl · 2021-06-14T18:49:32.920Z · LW(p) · GW(p)
"people with enough time to consume content slowly" understanding a concept from reading can sometimes take longer then a short video that created to concisely explain a concept.