2

I am French and the restrictions in place in my country are driven by several indicators, including "the total number of infections per day". For reference, it is currently given as about 10000.

My general question is: how statistically relevant is this number? (the specific question is further below)

It is based on the number of tests, so if a country does 0 tests per day, this number of infections would be 0. If every citizen was tested daily, we would have the exact number of infected people every day (I put aside the 24-48 hours of delay to get the results, this is not relevant in my question - let's assume we have the results immediately).

France does a certain amount of tests per day, several hundreds of thousands probably, but less than the 70 million our country has of citizens.

We could imagine that if $n$ people out of $m$ tested are positive (a number that is not provided - it probably exists somewhere but it is not the indicator), one could extrapolate that we have $n/m \times 70000000$ infected people every day.

That would in turn assume that the people that get tested are representative of the whole population - which is certainly not the case.

My specific question: are there models that allow the extrapolation from the non-representative tested population (based, say, on a set of questions about them, their age, health, occupation, etc.) to estimate the true number of ~10000 infected people every day?

In other words: does this number make sense?


I would like, just in case, to mention that I do not want to discuss whether using that number is relevant, or whether that number means something biologically speaking. I am just trying to understand if it can, statistically, reflect a reality

WoJ
  • 241
  • 1
  • 2
  • 6
  • Can't say for France. In CA daily numbers of cases are said to be based on a combination of hospital admissions for Covid-19, and positive tests reported by nursing homes, clinics, and various local government agencies. // In a statistical sense, reports based on tests depend on the accuracy of tests, some with unknown false positive rates. // In a practical sense, remaining space in hospitals, especially in their intensive care units, determines the level of care that can be provided over the short term. So it is reasonable to use such reliable data as a guide for government restrictions. – BruceET Dec 09 '20 at 23:31

1 Answers1

1

From the point of sampling theory, I agree getting tested represents a self-selection issue. Hence, statistics derived therefrom can describe the parent population but usually without explicit statistical means to assert margins of error.

However, all is not lost, as there are advanced paths to use the data together with associated covariants to create a random sample subset. This is based on an exchangeability principle and basically one employs models (like a logit classification scheme) to assign self-selection volunteers to a class that is in accord (via covariants) with classification to the general population.

Here is a work Sampling bias and logistic models that may outline the principles. The paper can also be found in J. R. Statist. Soc. B (2008), 70, Part 4, pp. 643–677. Also, a related work Propensity score adjustment using machine learning classification algorithms to control selection bias in online surveys.

AJKOER
  • 1,800
  • 1
  • 9
  • 9
  • 1
    This is a good answer, but the problem in relation to covid-19 PCR tests is that no such auxiliary data exists. So the majority of the data is junk. Proper sampling and recording of information about the spread is needed to improve the situation. – Sextus Empiricus Dec 14 '20 at 07:22