We have a question that is quite puzzling but that will surely be something others have come across (e.g., in epidemiology). There are two methods we are thinking of to get a prevalence estimate, the question is, which one fits our aims, and are these methods similar when it comes to estimating their standard error or CI?
We aim to do the following:
- Get an accurate estimate of the prevalence of ‘x’ in the population, where a member of the population can either be x or not be x
- Test this against a given percentage (e.g., to see if it’s greater than 1% with a binom.test [in method M below, how would that work?])
- Do some kind of power analysis to see if the method (and our samples that we base on feasibility) is informative enough for aim 1 and 2
Method M
We would start at a population with a known amount of members: 371,949. Then screen this population to count all members that meet condition X, namely: 13077 (so that is 3.5% of the population who are members with X). Then, we randomly sample from members with X, a sample size of say 1000 for which we do a (time consuming) manual check to verify whether members actually fit the conditions we want to get the estimate for. Let’s call that ‘actual x’. In this last step, let’s say we determine 50% (500 out of 1000) to be ‘actual x’, then we generalise back to the population saying (.5*3.5=) 1,75% is our prevalence estimate of ‘actual x’ in the population.
The puzzling thing here is that our estimate is in n = 1000 for the sample that has condition X, and what then happens to the SD when you multiply it to say something about the whole population (in which we just have the numbers, there is no uncertainty around that this amount of members of the population meet condition X). In other words, how would you calculate the SE/CI for the M method? and does that match the second method we describe below:
Method A
We would again start with a know population of 371,949 members. Then draw a random sample of say 30,000 from this, and in that sample screen for condition X, which would give us ~ 1050. Those ~1050 we would manually code (the time consuming step), and let’s say we find 450 to be ‘actual x’, then we would generalise back to the population saying (450 is 1.5% of 30,000 so) the prevalence in the population is 1.5%.
For the three aims stated above, is there a better method to choose, and if they are the same, how would one calculate the SE/CI for the M method?
Anything you can point us at is much appreciated! Thanks!