1

We have a question that is quite puzzling but that will surely be something others have come across (e.g., in epidemiology). There are two methods we are thinking of to get a prevalence estimate, the question is, which one fits our aims, and are these methods similar when it comes to estimating their standard error or CI?

We aim to do the following:

  1. Get an accurate estimate of the prevalence of ‘x’ in the population, where a member of the population can either be x or not be x
  2. Test this against a given percentage (e.g., to see if it’s greater than 1% with a binom.test [in method M below, how would that work?])
  3. Do some kind of power analysis to see if the method (and our samples that we base on feasibility) is informative enough for aim 1 and 2

Method M

We would start at a population with a known amount of members: 371,949. Then screen this population to count all members that meet condition X, namely: 13077 (so that is 3.5% of the population who are members with X). Then, we randomly sample from members with X, a sample size of say 1000 for which we do a (time consuming) manual check to verify whether members actually fit the conditions we want to get the estimate for. Let’s call that ‘actual x’. In this last step, let’s say we determine 50% (500 out of 1000) to be ‘actual x’, then we generalise back to the population saying (.5*3.5=) 1,75% is our prevalence estimate of ‘actual x’ in the population.

The puzzling thing here is that our estimate is in n = 1000 for the sample that has condition X, and what then happens to the SD when you multiply it to say something about the whole population (in which we just have the numbers, there is no uncertainty around that this amount of members of the population meet condition X). In other words, how would you calculate the SE/CI for the M method? and does that match the second method we describe below:

Method A

We would again start with a know population of 371,949 members. Then draw a random sample of say 30,000 from this, and in that sample screen for condition X, which would give us ~ 1050. Those ~1050 we would manually code (the time consuming step), and let’s say we find 450 to be ‘actual x’, then we would generalise back to the population saying (450 is 1.5% of 30,000 so) the prevalence in the population is 1.5%.

For the three aims stated above, is there a better method to choose, and if they are the same, how would one calculate the SE/CI for the M method?

Anything you can point us at is much appreciated! Thanks!

1 Answers1

1

Interesting problem! I think this is a special kind of quantification problem. The main message that people should remember is that quantification learning is not a by-product of classification, but it's a field on its own.

First, let's define some parameters. Let the first subscript of $X$ be the true, manually checked, label and let the second subscript of $X$ be the unchecked, estimated label from your dataset. Let's say that you are interested in the prevalence of class $1$ compared to your total population, which has members from class $0$ and class $1$. So $X_{11}$ denotes the number of members in your population that belongs to your class of interest and has the correct label in your dataset. Furthermore, let $N$ be the size of the total population and let $\alpha$ be the prevalence. You are interested in the prevalence of class $1$ in the population, which can be denoted as $$ \alpha = \frac{X_{10} + X_{11}}{N}.$$ We do not know the values of $X_{10}$ and $X_{11}$, so we cannot directly calculate the prevalence. However, you supposed two sampling techniques which we can use to estimate the prevalence $\alpha$.

From your description, I assume that members of your population estimated as class $0$ do not truely belong to class $1$. In other words, there are no false negatives, so $P(X_{10}) = 0$, which makes your negative predictive value (NPV) $1$. The NPV can be computed as: $$\text{NPV} = \frac{\text{True Negatives}}{\text{True negatives} + \text{False Negatives}} = \frac{X_{00}}{X_{00} + X_{10}} = 1.$$ Note that the negative predictive value is different from a true negative rate (TNR): $$\text{TNR} = \frac{\text{True Negatives}}{\text{True Negatives + False Positives}} = \frac{X_{00}}{X_{00} + X_{01}}.$$

This assumption is an important one. Since your estimated prevalence is small $(\pm 1.75\%)$ and your group of estimated negatives is large $(\pm 96.5\%)$, a small violation from the assumption that there are no false negatives in the population can lead to a large bias of your estimated prevalence. If only $1\%$ of your estimated members in class $0$ belongs actually to class $1$, your estimated prevalence $\alpha$ can increase from $1.75 \%$ to approximately $2.70\%$, a relatively large difference! So my warning is to be aware of this problem. It can occur, even with small numbers.

Back to the methods: I would advise you to apply method M in case you do not have any false negatives. Since the sample size is fixed and you truly know that exactly $3.5\%$ of your population is estimated as class $1$, you can estimate your positive predictive value $(PPV = \frac{X_{11}}{X_{11} + X_{01}})$ and scale it with your initial estimate of $3.5 \%$. The variance of your PPV can be estimated with a hypergeometric distribution and the confidence bounds can be scaled with your initial estimate of $3.5\%$.

In the case of false negatives, I would advise applying method A and also checking the false negative rate in this process. If you want to compute the variance of your estimate, it gets a bit more complicated. This is a tougher quantification problem and more information can be found in this paper. Good luck with your research!

Karolis Koncevičius
  • 4,282
  • 7
  • 30
  • 47