SARS-CoV-2 pool-testing algorithm puzzle

post by Steven Byrnes (steve2152) · 2020-03-20T13:22:44.121Z · LW · GW · 18 comments

This is a question post.

Contents

  Answers
    14 John_Maxwell
    11 jacobm
    7 Zack_M_Davis
    3 Ben Champion
    -1 Gurkenglas
None
18 comments

According to this article, it's possible to test a pool of up to (at least) 64 people for SARS-CoV-2 at once, and the test will say whether at least one person in the pool is positive. So, here's the puzzle:

We have a large number N of people that need to be tested. Each has SARS-CoV-2 with probability P. (The cases of interest are P<5%, maybe even P<1%, see comment [LW(p) · GW(p)].) We can test any subset of the N (or maybe up to a maximum pool size S), with the test returning positive if at least person in the subset is positive. (Bonus: allow that the tests are imperfect, with a per-test false positive rate FPR and false negative rate FNR.)

Your mission is to develop a testing protocol, i.e. an algorithm for what tests to run in what order. The simplest example (proposed in that article linked above) would be: Test non-overlapping groups, and if the test is positive, test each individual within the group. But is there a better way?

Score your algorithm's success on some combination of: (1) minimal total expected number of tests, (2) trying not to ask people for too many samples (lower priority on this one I think[1]), (3) false positive and false negative rates for individuals, (4) minimal serial time (i.e., minimal number of steps where we need results from step N to figure out what tests to run in step N+1).

(I'm not sure how to combine these criteria into a single score that reflects practical goals; if anyone knows, please comment. Or tell me if I'm leaving something out.)

Bonus: What if we have prior information, in the form of a N×N correlation matrix? (This could reflect the fact that for cohabitants / colleagues / neighbors, it's likelier that both or neither have it.)


  1. The swabbing process is mildly unpleasant, so other things equal, better to not require lots of swabs. I don't know whether or not it's possible to split the sample one swab and put it into multiple test-pools. ↩︎

Answers

answer by John_Maxwell · 2020-03-22T06:59:18.196Z · LW(p) · GW(p)

This is a great question. However, I think we need to know more about the properties of the test before the problem is fully specified: What are the false positive and false negative rates, and how do these numbers change in the pool-testing case?

This FAQ on the CDC website links to this pdf on the FDA's website. I don't have the background to understand most of this, but I pasted some snippets that seem interesting below. Note that it appears this info only applies to a particular CDC-endorsed test being used in the US--I don't know what's in widespread use internationally.

Negative results do not preclude 2019-nCoV infection and should not be used as the sole basis for treatment or other patient management decisions. Negative results must be combined with clinical observations, patient history, and epidemiological information.

False negative rate could be high.

2019-nCoV Positive Control (nCoVPC)

For use as a positive control with the CDC 2019- nCoV Real-Time RT-PCR Diagnostic Panel procedure.

It sounds as though the standard test kit includes 4 test tubes of noninfectious material that should produce a positive test result if things are working properly. This could give a general sense of what dilution levels the test continues to be accurate at.

Page 32 of the PDF has a section labelled "2019-nCoV Markers (N1 and N2)" which keeps referencing "threshold lines". This makes me think that although the test is advertised as providing a discrete positive/negative outcome, it really gives a continuous output along with guidelines for discretization (but in theory this continuous output could be translated into a probabilistic assessment instead, if we had the necessary data and some theoretical understanding of what's going on).

Additionally, it might be possible to make a vague guess about the likely number of infected individuals in a pool as a function of these continuous parameters, esp. if combined with info re: where each is in their disease progression (see time series note below).

There's a "2019-nCoV rRT-PCR Diagnostic Panel Results Interpretation Guide" section which discusses the possibility of inconclusive results. However, I think Table 9 on p. 41 of the PDF weakly suggests that they are rare.

From the "Limitations" section of the PDF:

Optimum specimen types and timing for peak viral levels during infections caused by 2019-nCoV have not been determined. Collection of multiple specimens (types and time points) from the same patient may be necessary to detect the virus.

So now the problem has a time series aspect. Things keep getting more and more interesting ;)

A false negative result may occur if a specimen is improperly collected, transported or handled. False negative results may also occur if amplification inhibitors are present in the specimen or if inadequate numbers of organisms are present in the specimen.

So now we have to worry about whether each individual testee might have "amplification inhibitors" which screw up the entire batch? Hey CDC, are you sure there's no way to do some kind of control amplification, to check if amplification is working as intended regardless of the specimen's 2019- nCoV status?

Positive and negative predictive values are highly dependent on prevalence. False negative test results are more likely when prevalence of disease is high. False positive test results are more likely when prevalence is moderate to low.

"Priors are a thing." Seems obvious, but if someone is going to create a web app that makes it easy to conduct pool tests, it should probably ask the user what the prevalence of 2019- nCoV is in the local population. You don't want a user to blindly trust a prediction made based on an inaccurate base rate.

If the virus mutates in the rRT-PCR target region, 2019-nCoV may not be detected or may be detected less predictably. Inhibitors or other types of interference may produce a false negative result. An interference study evaluating the effect of common cold medications was not performed.

Good to know.

Performance Characteristics

This section looks pretty relevant.

LoD studies determine the lowest detectable concentration of 2019-nCoV at which approximately 95% of all (true positive) replicates test positive. The LoD was determined by limiting dilution studies using characterized samples.

...

Tables 4 & 5 on page 37 of the PDF looks very interesting, displaying test sensitivity as a function of RNA copies. I'm not sure how to interpret the "Mean Ct" row, or what it means for a dilution to be "≥ 95% positive". I'm also not sure why there are subtables for 2019-nCoV_N1 and 2019-nCoV_N2 (it looks like they represent two different test markers?)

I also don't know what a typical number of RNA copies would be in an infectious specimen.

Tables 4 & 5 make me a bit pessimistic about this entire project, because they imply that a factor of 10 dilution (from roughly 3 RNA copies per μL to roughly 1 RNA copy per 3 per μL) reduces test sensitivity from ~100% to ~50%. Then again, it's possible that these numbers are chosen because the CDC wanted to figure out the precise range that the test broke down in, and in a real-world scenario, an infectious specimen is likely to have way more than 3 RNA copies per μL (stands to reason doesn't it?)

Anyway, the best approach may be an algorithm which takes conditional probabilities (and maybe also prior information about local prevalance) as inputs, so the algorithm can be run with minimal tweaks as more data is gathered regarding the real false positive/false negative rates + the algorithm can be used with any test whose performance characteristics are known.

An alternate frame is instead of thinking of each testee as being either positive or negative for the virus, think of each testee as having some number of 2019-nCoV RNA copies per μL of specimen (0 if they're negative for 2019-nCoV, presumably!)

Through this lens, you could see the problem as an active learning problem where the goal is to learn a regression coefficient for the 2019-nCoV RNA concentration for each testee. If you're using a clever algorithm powered by a good noise model, you will probably sometimes find yourself testing specimens which aren't an even mix of different testee specimens (example: we think Patient A might have a very small number of RNA copies per μL, so we test a specimen which is 50% from Patient A but 10% from Patients B/C/D/E/F because that gives us more information).

Side note: In the course of this research, I happened to notice this note on the CDC website:

I believe that I have found a treatment or vaccine for COVID-19. Is CDC the best place to submit my idea?

BARDA is providing a portal to support U.S. government medical countermeasure research and development. Interested stakeholders can learn more here.

That could be a good link to use once we think we have something which is good enough to share with the US government.

EDIT: If anyone can put me in contact with someone who's actually performed these tests, and ideally also has some knowledge of how they work under the hood, I would love that. Active learning is a topic I'm very interested in. I can program the app if you can answer all of my questions about the testing process. Send me a message via my user page with your contact info.

comment by John_Maxwell (John_Maxwell_IV) · 2020-03-23T20:54:44.769Z · LW(p) · GW(p)

I set up a calendar for medical professionals to book me here:

https://calendly.com/book-john/discuss-covid-19-testing

answer by jacobm · 2020-03-23T23:32:38.644Z · LW(p) · GW(p)

This article provides a helpful look at optimal pooling strategies: https://arxiv.org/pdf/1007.4903.pdf

"We show that if the prevalence of positive samples is greater than 30% it is never worth pooling. From 30% down to 1% pools of size 4 are close to optimal. Below 1% substantial gains can be made by pooling, especially if the samples are pooled twice. However, with large pools the sensitivity of the test will fall correspondingly and this must be taken into consideration. We derive simple expressions for the optimal pool size and for the corresponding proportion of samples tested."

comment by John_Maxwell (John_Maxwell_IV) · 2020-03-24T04:15:40.258Z · LW(p) · GW(p)

I think it may be possible to do even better than the algorithm in this paper if your 2nd round of pooling is allowed to cut across multiple pools from your first round.

answer by Zack_M_Davis · 2020-03-22T05:19:13.967Z · LW(p) · GW(p)

This reminds me of that cute information-theory puzzle about finding the ball with the different weight! I'm pretty dumb and bad at math, but I think the way this works is that since each test is a yes-or-no question, we can reduce our uncertainty by at most one bit with each test, and, as in Twenty Questions, we want to choose the question such that we get that bit.

A start on the simplest variant of the problem, where we're assuming the tests are perfect and just trying to minimize the number of tests: the probability of at least one person in the pool having the 'roni in pool size S is going to be , the complement of the probability of them all not having it. We want to choose S such that this quantity is close to 0.5, so . (For example, if P = 0.05, then S ≈ 13.51.)

My next thought is that when we do get a positive test with this group size, we should keep halving that group to find out which person is positive—but that would only work with certainty if there were exactly one positive (if the "hit" is in one half of the group, you know it's not in the other), which isn't necessarily the case (we could have gotten "lucky" and got more than one hit in our group that was chosen to have a fifty-fifty shot of having at least one) ...

Partial credit???

answer by Ben Champion · 2020-03-22T08:27:48.398Z · LW(p) · GW(p)

[EDIT:

I see the monograph in gwern's post below [LW(p) · GW(p)] makes the observation that group testing is a Boolean version of compressed sensing and references Gilbert et al. 2008 which I'm just reading - found a freely available version on Google Scholar. And more importantly, the monograph contains algorithms with similar bounds to CS.

READ GWERN'S POST BELOW [LW(p) · GW(p)] FIRST

]

Some ideas...

This is can be seen as a nonlinear version of a compressed sensing problem (see here for a good "25 minute" intro to compressed sensing, or the Wikipedia article).

The nonlinearity comes from the fact that the pooled test seems to behave like an approximate OR-gate: if any sample is positive the pooled sample will be positive, otherwise the pooled sample will test negative. I'm not sure how to deal with that.

The vector to be "reconstructed" is

(We can replace "has" with "is deemed by this test to have", obviously there will be an error rate)

Let our "sampling matrix" be

Then our sample results are

where denotes the sample-taking operation, I'm using the OR symbol here assuming that's approximately how it works.

If instead the sample response is linear and we rescale so that the sample responses count the number of positives in each sample we have


The compressed sensing idea is, if you choose the rows (sampling patterns) to be random and incoherent (meaning, very roughly, not too similar), you only need roughly samples to reconstruct with "high" probability, where is the number of nonzeros in . (All this hand-waving is tightened up in the literature). In other words, we only need to do roughly pooled samples. The method is to solve:

There are known algorithms for doing this (see the 25-minute intro), e.g. any linear programming (LP) algorithm. I'm thinking if there is a way to encode the OR-gate behaviour as LP (strictly speaking, mixed-integer programming, MIP) constraints. Surely the results are different in terms of required and computational efficiency...but worth trying I think.

Some thought still required on how the test false-positive, false-negative, true-positive, true-negative rates propagate through the model.

Well-known researchers in this area are Terence Tao, Emmanuel Candès, and David Donoho ... there are undoubtedly many others but (clearly!) I don't have extensive knowledge of the literature.

The connection with gwern's post below [LW(p) · GW(p)] is that another way of looking at compressed sensing is "sub-Nyquist/Shannon sampling for sparse signals" - the information theory connection again. [EDIT: and the algorithms there seem to achieve similar performance bounds]

If the maximum subset size S<N then I think you can just do batch processing of batches of size S?

answer by Gurkenglas · 2020-03-21T01:10:38.330Z · LW(p) · GW(p)

Test random overlapping groups, then logically deduce who isn't infected and who how probably is. Tune group size and test count using simulations on generated data. I intuit that serial tests gain little unless P is << 1/64. In that case, test non-overlapping groups, then run the non-serial protocol on everyone who was in a yes-group - within those, we can update P to >= 1/64.

18 comments

Comments sorted by top scores.

comment by gwern · 2020-03-20T17:13:51.811Z · LW(p) · GW(p)

If you want to get really hardcore in reading up on pool/group testing, there's a recent monograph: "Group Testing: An Information Theory Perspective: chapter 3: Algorithms for Noisy Group Testing", Aldridge et al 2019.

Replies from: BenChampion
comment by Ben Champion (BenChampion) · 2020-03-22T17:43:54.051Z · LW(p) · GW(p)

Would you be able to post this under answers? It seems to describe algorithms and their performance bounds in quite some detail and definitely deserves to be the top answer.

Replies from: gwern
comment by gwern · 2020-03-22T14:33:57.103Z · LW(p) · GW(p)

It describes a lot of things, and I'm not sure which ones would be the right answer here. If anyone wants to read through and describe which one seems to best fit the corona problem, that'd be a better answer than just providing an enormous difficult-to-read monograph.

comment by Dagon · 2020-03-20T16:45:36.240Z · LW(p) · GW(p)

Interesting idea, which I hope the testing centers are using (but hadn't heard it, so maybe not!). The prior is very important. Initially, the likelihood of positive for each sample, but correlation can help to find optimal groupings. Which gives us a probability of positive for any arbitrary group. Bayes can tell us how to update the individual probabilities when a group tests positive.

I haven't written the simulator, but my expectation is that if most of the samples have significant likelihood of positive, you don't gain much by grouping - if you test binary search, you'll waste more in the groupings than you gain in skipping individual tests when a group is negative. If positives are unlikely, you should be able to group such that you can maximize information by grouping such that there's 50% to be negative for the group. Eliminate all the negatives at once, and regroup based on new probabilities (posterior of this test, prior for next) to repeat.

Eventually, you get down to individuals with 50% likelihood, and you kind of have to test each. Or declare that you treat 50% (or 30% or some other likeihood) "good enough to treat", and skip the tests that won't affect treatment.

Replies from: steve2152
comment by Steven Byrnes (steve2152) · 2020-03-21T00:59:11.724Z · LW(p) · GW(p)

Thanks! Interesting thoughts.

I agree that if P≥50% then pooling is likely useless. We eventually want to be doing things like testing everyone who has come anywhere close to a known case, and/or testing absolutely everyone who has a fever, things like that. So if we do things right, we're eventually hoping to be at P<5%, or maybe even P<<1% in the longer term. South Korea is commendably aggressive in testing, and they're at P<5%, or something like that.

comment by Yoav Ravid · 2020-03-21T19:54:12.421Z · LW(p) · GW(p)

Even without finding the exact person who's sick, this can go along very well with community based approaches (as described here for example), where you try to find communities that have infections and isolate them from other communities to stop the spread from getting everywhere. its especially helpful when you have a large scarcity of tests.

comment by Steven Byrnes (steve2152) · 2020-03-21T21:57:46.831Z · LW(p) · GW(p)

FYI, there's some discussion at this blog post

comment by Peter Wildeford (peter_hurford) · 2020-03-21T04:38:09.360Z · LW(p) · GW(p)

What about binary search?

Replies from: steve2152
comment by Steven Byrnes (steve2152) · 2020-03-21T10:23:48.764Z · LW(p) · GW(p)

Don't forget there could be many positives per pool...

Replies from: peter_hurford
comment by Peter Wildeford (peter_hurford) · 2020-03-21T13:56:09.382Z · LW(p) · GW(p)

A binary search strategy still could be more efficient, depending on the ratio of positives to negatives.

comment by jimrandomh · 2020-03-20T18:19:16.943Z · LW(p) · GW(p)

This can only be used on groups where everyone is asymptomatic, and there will be low limits on the pool size even then.

The first step of a PCR test is RNA amplification; you use enzymes which take a small amount of RNA in the sample, and produce a large number of copies. The problem is that there are other RNA viruses besides SARS-CoV-2, such as influenza, and depending when in the disease course the samples were taken, the amount of irrelevant RNA might exceed the amount of SARS-CoV-2 RNA by orders of magnitude, which would lead to a false negative.

Replies from: steve2152, JessRiedel
comment by Steven Byrnes (steve2152) · 2020-03-20T18:39:16.420Z · LW(p) · GW(p)

Did you read the link at the top? It says it works for 64 people, right? Is that article unreliable or misleading or something?

Replies from: jimrandomh
comment by jimrandomh · 2020-03-21T06:23:20.857Z · LW(p) · GW(p)

It's a newspaper article based on an unpublished paper; that reference class of writing can't be trusted to report the caveats.

(I could be wrong about the mechanics of PCR; I'm not an expert in it; but the article itself doesn't provide much information about that.)

Replies from: johnyaku, steve2152, John_Maxwell_IV
comment by johnyaku · 2020-04-14T11:05:04.186Z · LW(p) · GW(p)

Pre-print of the paper with details of the methods, etc: https://www.medrxiv.org/content/10.1101/2020.03.26.20039438v1.full.pdf

Similar methodology, but with pools of 10:

https://www.medrxiv.org/content/10.1101/2020.03.30.20043513v1.full.pdf

Peer-reviewed paper with pools of up to 96 samples for avian influenza:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2878216/

None of this is surprising given that...

1) The threshold for detecting the (very similar) SARS virus was 37~4000 copies per mL (depending on the platform -- this suggests that choice of platform will be critical for this to work well) https://jcm.asm.org/content/42/5/2094

2) Patients with COVID-19 generally have 10,000 to 1,000,000 copies per swab. Not sure how many mL per swab, but not many I suspect: https://www.nature.com/articles/s41586-020-2196-x_reference.pdf

3) PCR uses primers to selectively amplify only the target RNA. The presence of other RNA is not really a problem

...

Put it all together and here is the argument: https://www.medrxiv.org/content/10.1101/2020.03.27.20043968v1.full.pdf

comment by Steven Byrnes (steve2152) · 2020-03-21T20:08:35.537Z · LW(p) · GW(p)

I certainly agree that there's a very very good chance that no lives will be saved if we do a great job quantifying the potential benefits of test-pooling and disseminating that information. But I still think it's an activity with a sufficiently high expected value as to be worth someone's time to do ... compared to other ways that theoretical CS types could be spending their time to "help the war effort" :-)

comment by John_Maxwell (John_Maxwell_IV) · 2020-03-24T02:58:22.167Z · LW(p) · GW(p)

This appears to be the university press release, if you consider that to be more reliable:

According to Prof. Roy Kishony, head of the research group in the Faculty of Biology at Technion, “This is not a scientific breakthrough, but a demonstration of the effectivity of using the existing method and even the existing equipment to significantly increase the volume of samples tested per day. This is done by pooling multiple samples in a single test tube. Even when we conducted a joint examination of 64 samples in which only one was a positive carrier, the system identified that there was a positive sample. Although there are some logistical challenges in implementing the method, we expect that it will greatly increase the volume of samples tested per day so that we can identify the asymptomatic carriers. This approach should reduce the chance of infection and flatten the infection curve.”

One hopes that the university wouldn't misquote one of their own professors in their own press release :)

comment by JessRiedel · 2020-03-21T19:26:47.553Z · LW(p) · GW(p)
The problem is that there are other RNA viruses besides SARS-CoV-2, such as influenza, and depending when in the disease course the samples were taken, the amount of irrelevant RNA might exceed the amount of SARS-CoV-2 RNA by orders of magnitude

There is going to be tons of RNA in saliva from sources besides SARS-CoV-2 always. Bits of RNA are floating around everywhere. Yes, there is some minimum threshold of SARS-CoV-2 density at which the test will fail to detect it, but this should just scale up by a factor of N when pooling over N people. I don't see why other RNA those people have will be a problem any more than the other sources of RNA in a single person are a problem for a non-pooled test.

comment by johnyaku · 2020-03-21T19:02:53.847Z · LW(p) · GW(p)

Great puzzle. Bear in mind the asymmetry between false positives and false negatives in this case. For false positives (positive result but not actually carrying the virus) we recommend that they self-isolate for two weeks and monitor symptoms. They lose two weeks of freedom of movement/interaction, but are otherwise unharmed (and relatively protected from new infections during this period). A false negative, on the other hand, can go on to infect other people, even if they remain asymptomatic. Some of the people downstream in the infection chain may be seriously harmed AND the epidemic is prolonged.