How many tests should we do to estimate the percentage of people who contracted COVID-19 in Lombardy?

Question

Lombardy, a region of Italy, registered many severe cases of COVID-19 in the last few months. Unfortunately the available data don't allow us to estimate the percentage P of people who contracted the virus in Lombardy (please don't dispute this, but take it as an assumption). Now, suppose we have a medical test that says wether a person contracted the virus or not (EDIT: let's assume that the test always gives the correct answer). How many people should we test to estimate P with sufficiently small error?

Data are:

10 million people live in Lombardy.
P can be assumed to be at least 4 %.
No assumption can be made about the uniform distribution of P, neither geographically, nor by age, gender or whatever.
A satisfactory error percent on P would be 25%.

A related question is: how could we confirm the estimated error on P once we have the data? For example, could we bootstrap the data?

Thanks

Thanks. I want to assume here that the test is 100% correct. — user7669, Mar 28 '20 at 16:49
I think the only relevant thing here is the error percent on P, which is desired to be of 25%: If you test all inhabitants, you haven an error of 0% (you are completely sure about the percentage of infected people), if you test 0 inhabitants, you have an error of 100% (you have no idea of the percentage of infected people). Thus, testing 7.5 million (75% of the population) gives you the desired accuracy of 25%. Right? — nukimov, Mar 28 '20 at 17:26
@nukimov that’s incorrect. If that was true, people doing research on any topic would need to survey big fractions of the population to get meaningful results, they don’t. — Tim, Mar 28 '20 at 18:38

Tim · Accepted Answer · 2020-03-29T09:23:53.903

This is actually a handbook example of determining the sample size needed for estimating binomial proportion (e.g. Jones et al, 2004, Naing, 2003 for other references and examples).

First of all, to make it more precise, we are talking about finding such sample size, that with probability $\alpha$, the difference between the true probability of being infected $p$ and it's estimate $\hat p$ is not greater then $(100\times\delta\,)\%$

$$ \Pr(|p - \hat p| \le \delta p) = \alpha $$

Given that the target population is large, we would usually assume binomial distribution to represent it, i.e. we say that it is large enough, that the chance of randomly sampling someone more then once is negligible. The distribution is parametrized by probability of "success" (here, probability of being infected) $p$ and the number of samples we draw $n$. Let's denote the observed number of infected people as $k$, in such case, $\hat p = k/n$ is the fraction of infected people in the sample and we treat it as an estimate of the number of infected people in the whole population. If we wanted to calculate confidence interval for $\hat p$, we could use normal approximation

$$ \hat p \pm z_\alpha \sqrt{\frac{\hat p(1-\hat p)}{n}} $$

where $z_\alpha$ is the ordinate from standard normal distribution, where for $z$ drawn from standard normal distribution we have $\Pr(-z_\alpha < z < z_\alpha) = \alpha$. You are saying, that you'd like to see this interval to be equal to $\hat p \pm \delta p$. As discussed in the linked resources, you can solve this, so that for given $p$, precision $\delta$, and certanity $\alpha$, you can guesstimate the sample size needed

$$ n \approx \Big(\frac{z_\alpha}{\delta p}\Big)^2 \; p(1-p) $$

Assuming $(100 \times \alpha)\% = 99\%$ confidence interval, we can plot this for different values of $p$, to find out that for $100 \times p > 4 \%$ the needed sample sizes are generally not much larger then $2000$ samples.

For example, for $p=0.04$ ($4\%$ infected) this yields:

> z <- function(alpha) qnorm(alpha)
> n <- function(p, alpha=0.99, delta=0.25) (z(alpha)/(p*delta))^2 * p*(1-p)
> n(0.04)
[1] 2078.167

To convince yourself, you can verify this by simulation, where you would draw $n$ samples from binomial distribution with probability of infection $p$, repeat this procedure $R$ times, and then verify how often was your result not further then $(100 \times \delta) \%$ from the true value:

> set.seed(123)
> sim <- function(p, n, delta, nsim=50000) mean(abs(p - rbinom(nsim, n, p)/n) / p <= delta)
> sim(0.04, 2078, 0.25)
[1] 0.97858

So we wanted to be $99\%$ sure and the approximation gives us, while the in the simulation, in $97.8\%$ cases the result was within the interval. Not bad.

Notice that this is just a simple approximation for the calculation, assuming simple random sampling. In case of whole population locked in their houses, sampling individuals at random may be not as hard as in case of most of the usual surveys. On another hand, things may not go as smooth as planned, or you may be willing to use other sampling schema, to have higher chance for it being representative, what would make calculating it more complicated. Moreover, the tests used aren't perfect and give false results as described, for example by New York Times, or Washington Post, and you'd need to account for that as well. Also you need to remember, there were many examples where such simple problems get more complicated then expected, e.g. social surveys on Trump's support before the election got very wrong, nonetheless that they used state of art survey methodology.

Thanks. I think that I was able to follow. Maybe in the confidence interval we could use the true variance instead of the estimated one. I don't know much of the avaiable medical tests that could be used. These should not only detect the virus in sick persons, but also the antibodies in recovered persons. — user7669, Mar 29 '20 at 20:29
@user7669 for binomial distribution when mean (probability of "success") is $p$, then the variance is $p(1-p)$. — Tim, Mar 29 '20 at 20:31
Yes. This was what I was thinking, because $\hat{p}$ was used instead of $p$ in the formula for the confidence interval under the normal approximation. — user7669, Mar 30 '20 at 06:00
@user7669 True mean or variance is usually not known and if you knew it, you wouldn't need to estimate it and wouldn't care about confidence interval around the prediction. — Tim, Mar 30 '20 at 06:37
Thanks for this answer, very helpful. Can you put the meaning of delta into words? The OP calls it the error on p. But it seems to behave differently than an error. Holding p constant, the larger delta is the fewer tests are needed. I'm thinking it's a measure of the belief in p. delta = 1.0, you believe strongly in the claimed value of p. delta = 0.1, one has little confidence. Is this right? What's the word or phrase that best describes delta? — Bryan Hanson, May 04 '20 at 01:48
@BryanHanson it was described in words, I’m not sure what you mean? “Confidence” sounds like confidence interval, which it isn’t. “Degree of belief” sounds Bayesian, it isn’t either. Simply: the more precise you need to be, the more data you need. — Tim, May 04 '20 at 05:06
Thank you @tim I'm asking because I would like to explain this to a lay person. The mathematical effect of delta is clear. Conceptually, would you say that delta is a surrogate for the quality of the diagnostic tests used to measure p? High delta, the tests are reliable, lower delta, not so reliable? If true, delta would be standing in *conceptually* for sensitivity and specificity measures, but not mathematically directly linked. — Bryan Hanson, May 04 '20 at 12:22
@BryanHanson it has nothing to do with that, it is about how certain you *want* to be about the result. It is unrelated to test precision. — Tim, May 04 '20 at 12:46
@tim Isn't alpha, the confidence interval, how certain one wants to be? Thanks for your patience. — Bryan Hanson, May 04 '20 at 12:55
@BryanHanson alpha is how certain you want to be and delta the margin of error — Tim, May 04 '20 at 13:17
I'm curious how do you deal with the bias inherent in people accepting/rejecting the invitation to participate in the testing study? Any links to papers/posts describing how they dealt with this would be appreciated. — user12344567, May 07 '20 at 18:53

How many tests should we do to estimate the percentage of people who contracted COVID-19 in Lombardy?

1 Answers1

Linked