I'm interested in a (preferably analytic) solution or approximation to the following problem:
Let $s_1$ be a sample from an unknown distribution of size $N_1$ and with proportion of successes $p_1$. Let $s_2$ be an independent sample from the same distribution of size $N_2$ with proportion $p_2$. Given $N_1$, $p_1$, and $N_2$, can we calculate a Confidence Interval for $p_2$?
I would love a general purpose analytic solution if anyone has one, but for simplicity I am fine with considering the case where both $s_1$ and $s_2$ satisfy the conditions for their sampling distributions to be approximated by a Gaussian distribution.
Now, my approaches to solving this have led me to 2 options:
- Find upper and lower bounds for the confidence interval of $p$ (the population proportion of "successes"), and plug these back into confidence intervals for $p_2$ using the sampling distribution for $p$ with size $N_2$. Then take the max and min of those intervals. Or
Treat $p$ as a normally distributed random variable with $\mu=p_1$ and $\sigma=\sqrt{\frac{p_1(1-p_1)}{N_1}}$, which would imply the CDF for $p_2$ can be found by:
$CDF(x) = \int_0^1{NormPDF(\frac{y-p_1}{\sqrt{\frac{p_1(1-p_1)}{N_1}}})\cdot NormCDF(\frac{x-y}{\sqrt{\frac{y(1-y)}{N_2}}})dy}$
where $NormPDF$ and $NormCDF$ are the PDF and CDF functions for the standard normal distribution.
The problem with 1 is that the interval found will be much wider than I would ideally want (this is what I am currently using in my equations). The problem with 2 is that I have no idea how to convert this into an analytic function (through approximation with $erf$ since I assume there is no analytic solution to the integral). My goal is to graph these intervals as a function of $p_1$ in desmos along with other sampling/prediction strategies for comparison - this is why I would really like an analytic solution or approximation.
If someone can solve this, or point me in the right direction to finding a solution that would be greatly appreciated!