Estimating the subset of a population which are sucessful when the population itself is an estimate

Question

Let's say I have a population $S$, with an estimated size $\hat{n}$ (and standard error $\sigma_{\hat{n}}$). The way that $\hat{n}$ is estimated is through generating random samples from a larger sample space of size $m$ ($n \ll m$), and then determining how many belong to $S$. For our purposes, $n$ can't realistically be determined any other way. These samples form a Bernoulli distribution (since a sample either belongs to $S$ or doesn't), and we calculate $\sigma_\hat{n}$ through normal approximation.

I'd like to sample from $S$ and determine how many samples belong to $T$, on the basis of some arbitrary criteria for $s \in S$. Let the observed proportion of $S$ which are in $T$ be called $\hat{p}$, and let's say we also use a normal approximation. My question is: how does $\sigma_\hat{n}$ "interact" with $\sigma_\hat{p}$? (since we want to calculate $\hat{n}\hat{p}$)

Some notes:

$\hat{n}$ and $\hat{p}$ are independent. There's no relationship between the two.
Let's say we're initially sampling from $R$ (of known size $m$) to find $\hat{n}$. Why not instead determine directly how many $r \in T$? The reason is, verifying that some $r$ or $s$ is in $T$ is very complex (PSPACE-hard). The maximum number of samples I can realistically verify to be in $T$ is so small that $m\hat{q}$ (where $\hat{q}$ is the observed proportion of $R$ in $T$) would have confidence intervals much too large to mean anything useful. So instead, I can achieve a very confident estimate of $\hat{n}$, and then sample from $S$ instead.

Any guidance appreciated.

Potential answer: propagation of normally-distributed errors, in our case when multiplying some $\sigma_1$ by $\sigma_2$: notes

Other comments: I initially asked some pretty incomprehensible questions, and really shouldn't have been given the time. Thanks for everyone's precious time, especially BruceET and whuber's.

Your question is a bit vague; are you sampling to estimate a proportion, the mean of some quantity, or the population size? A very accessible place to start learning about sampling is the Penn State Stat Online Course [STAT506, Sampling Theory and Methods](https://online.stat.psu.edu/stat506/). Good luck! — Mike Anderson, Dec 27 '20 at 19:30
Why don't you know the actual sample size? What are you trying to find out? What do you mean by 'successful'? — BruceET, Dec 27 '20 at 21:13
Hi Bruce, can you see the updated question? I don't know the sample size because it's not something I can easily calculate. I'm trying to find out the size of the subset of the population which is successful. And by successful, I mean that a member of the population passes the "success criteria". — Colin McDonagh, Dec 27 '20 at 21:27
I guess this question will remain closed, but I guess in essence what I was asking is how do we multiply two confidence intervals: https://stats.stackexchange.com/questions/305382/how-do-i-calculate-the-confidence-interval-for-the-product-of-two-numbers-with — Colin McDonagh, Dec 27 '20 at 22:09
That last comment helped me understand what you are trying to ask, so I would like to suggest that you consider editing the question to include a similar remark. It would help even more to provide more information about how $n$ and $\sigma_{\hat n}$ are estimated as well as about how you are able to obtain samples. Abstractly it's a strange situation and the description at least suggests the possibility that $\hat n$ and $\hat p$ are not independent, which may be an important consideration. — whuber, Dec 28 '20 at 13:45
What I get is the following. You have some population with two properties: the size of the population $n$ and the fraction success in the population $p$. Your question is how to describe an estimate the size of the number of success in the population $pn$. What you have is an estimate $\hat{n}$ with some deviation (standard error?) $\sigma_\hat{n}$ and you have an estimate $\hat{p}$ based on a sample from the population. — Sextus Empiricus, Dec 28 '20 at 21:16
This seems like you can approach this as the [product of two variables](https://stats.stackexchange.com/questions/15978/) for which you can express the error of the product based in the error of the individual terms. — Sextus Empiricus, Dec 28 '20 at 21:19
Thanks for your time guys. My initial posts were totally incomprehensible which is unfair on you who give of your precious time freely. It still might be incomprehensible though, so I hold out. I've updated the question in response to the last three comments. Yeah Sextus, except that the variables are independent. — Colin McDonagh, Dec 28 '20 at 22:55

score 2 · Accepted Answer · answered Dec 28 '20 at 23:10

2

An alternative approach is to sample from the large population R of size m>n untill you have some fixed number of successes (samples from T).

The sampling is done by testing whether a sample is S and if it is S then you test whether it is T/success. (So you do not need to do all the time the costly test to see if a sample is in T)

The number of samples that you need is negative Binomial distributed and based in that you can estimate a probability $\hat{p}$ for the fraction of T and S among in R and $\hat{p}m$ will be the estimate for the size.

answered Dec 28 '20 at 23:10

Sextus Empiricus

43,080
1
72
161

Ok interesting, thanks Sextus. Would you mind checking my logic here if you have time? I agree that checking $r \in S$ before $r \in T$ makes sense. In my case, I expect $\hat{p} \approx 10^{-3}$. The standard deviation of a binomial distribution using a normal approximation is $(\frac{p(1-p)}{N})^{1/2} \approx {(\frac{10^{-3}}{N})^{1/2}}$. I'd like to have $\sigma_{\hat{p}} \leq \frac{\hat{p}}{10}$, which means that approximately I must have at least $(\frac{10^{-3}}{N})^{1/2} = 10^{-4}$ (ignoring the 1.96 multiplier in the case of 95% CI). – Colin McDonagh Dec 29 '20 at 00:20
In which case $N = 10^5$, but I think the greatest $N$ I can have is $10^3$. Maybe I'll have to have a think about how I can reduce the size of $m$, thus increasing $\hat{p}$. But then again, maybe the error propagation approach would be easier – Colin McDonagh Dec 29 '20 at 00:20
Ah, sorry. If I can rule out most samples on the basis that $r \notin S$, such that the probability of $r \in T$ given $r \in S$ is approximately $1$, then I only need to do the harder verification of $r \in T$ for $\frac{10^5}{10^3}$... which is definitely possible :) – Colin McDonagh Dec 29 '20 at 00:26
Thank you Sextus! – Colin McDonagh Dec 29 '20 at 00:37

BruceET · Answer 2 · 2020-12-27T20:37:11.850

Suppose you are sampling college admission test scores from a large high school district. Traditionally, the district mean on this test has been 280 with a standard deviation of 25. Then an approximate 95% CI for this year's district mean would be of the form $\bar X \pm 2(25)/\sqrt{n}.$ If you want this year's CI to have margin of error of $\pm 10,$ then you have $50/\sqrt{n} \approx 50/\sqrt{n} = 10.$ So you need a sample size of $n \approx 25.$

Suppose you sample $n = 25$ observations at random from $\mathsf{Norm}(\mu=280, \sigma=25),$ to get data z as below. (Sampling and computations in R.)

set.seed(2020)
x = rnorm(25, 289, 25)
summary(x);  length(x);  sd(x)
     Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    234.9   267.9   282.3   284.0   299.3   326.0 
[1] 25         # sample size
[1] 22.52312   # sample SD

The t.test procedure in R, provides a 95% CI $(274.7,293.3)$ as part of its output, captured below using $-notation. The margin of error for this sample is about 9.3. [Margins of error will vary from sample to sample, depending on the sample standard deviation: for example, four additional samples of size $n=25$ gave margins of error 9.7, 12.9, 8.9, and 10.7.]

t.test(x)$conf.int
[1] 274.6643 293.2585
attr(,"conf.level")
[1] 0.95

Hi Bruce, apologies but I've updated the question. I didn't have the question clear in my head initially — Colin McDonagh, Dec 27 '20 at 21:04
Sorry. Updated version does not appear. Maybe someone else will give this a try. — BruceET, Dec 27 '20 at 21:09

Estimating the subset of a population which are sucessful when the population itself is an estimate

2 Answers2