7

I've got a stochastic variable $X$ which takes the values in $\{0,1,2\}$ with some unknown probabilities. I want to know the distribution, and can sample $X$ as many times as I want to. How many times will I need to sample to get some specific confidence interval?

(For instance, 99% sure that the estimated probabilities lie within 0.01 of the true probabilities.)

Anna
  • 339
  • 1
  • 5
  • 2
    I think this is more related to the sample size required to obtain certain error in the estimation of the parameters of a [multinomial distribution](http://en.wikipedia.org/wiki/Multinomial_distribution). Take a look at this [link](http://www.math.wsu.edu/faculty/genz/papers/mvnsing/node8.html). What you can do is to narrow the C.I. by increasing the sample size. –  May 23 '12 at 13:16
  • 3
    For an example of computing standard errors of estimates for linear functions of a trinomially distributed variable, please see http://stats.stackexchange.com/q/18603. Although the reply there does not fully answer your question, it exemplifies techniques useful for obtaining an answer. – whuber May 23 '12 at 13:46

2 Answers2

5

If you have a Bernoulli variable (0 or 1) with probability $p$, its variance is $p(1-p)$ which is always less than or equal to $1/4$. The mean of $n$ independent Bernoulli variables tends to a Gaussian of variance $p(1-p)/n$ which is always less than or equal to $1/4n$. You can use the fact that a Gaussian variable has a 99% chance of being within plus or minus 2.58 standard deviations, so you have to set your upper bound so that $2.58/(2\sqrt{n}) < .01$ which gives you $n \geq 16641$.

Because each of your three outcomes individually behaves as a Bernoulli variable and because this is a global upper bound you can also apply this number to your discrete variable with three outcomes.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
gui11aume
  • 13,383
  • 2
  • 44
  • 89
  • Does this work even though the three outcomes are mutually exclusive, i.e. not independent Bernoulli variables? – Anna May 23 '12 at 13:02
  • Yes because we took the upper bound of the variance in the first place. – gui11aume May 23 '12 at 13:08
  • +1. This certainly works but can you think of a way to give _sharp_ confidence intervals for each proportion? That would require accounting for the dependencies between the two intervals. – Macro May 23 '12 at 13:16
  • Thanks! I made heavy use of the "and can sample X as many times as I want to" so this solution is far from optimal as you noticed. Getting optimal (sharper) confidence regions is a bit more involved I think. You could use convergence to a 3D Gaussian with terms of correlation that depend on (p,q,r) and find a region of that space that has 99% probability and that again depends on (p,q,r). Possible but the answer will depend on (p,q,r). – gui11aume May 23 '12 at 13:23
  • (+1) I made a minor correction, hope you don't mind. – jbowman May 23 '12 at 18:43
  • 2
    @gui11aume, getting sharp confidence intervals is possible as long as you don't make an approximation to Gaussian (see my answer). The problem with approximating as a Gaussian is that the approximation is only good when $n$ is very large, but that's one of the things we can't assume since we want to know how big $n$ needs to be get a good confidence interval. – Neil G May 23 '12 at 21:43
  • (+1) Absolutely! Still, I felt safe with n > 16641 ;-) I measure a max difference of 0.01 between the quantiles of the Gaussian and the empiricical mean for $n \approx 250$. – gui11aume May 23 '12 at 22:05
2

The Bayesian way of doing this loses no information:

Your variable $X$ is categorically distributed with probability vector $\mathbf p$. The conjugate prior of the categorical distribution is the Dirichlet distribution, so let $\mathbf p$ be Dirichlet-distributed with shape parameter vector $\boldsymbol\phi$. With every observation, you update $\boldsymbol\phi$ by incrementing the realized component. You can then check to see if your maximum likelihood probability $\mathbf p^\star$ is within 0.01 of the true probability by integrating the ball with radius 0.01 centered at $\mathbf p^\star$.

Neil G
  • 13,633
  • 3
  • 41
  • 84
  • Could you explain the last sentence in a bit more detail. What is the true probability here if only the datasample are known? Integrating the ball radius, by Monte Carlo? – Vass Apr 09 '14 at 15:22
  • @Vass: The data induces a likelihood. So, the likelihood is also known. It has Dirichlet density, which you integrate to give the desired probability. – Neil G Apr 09 '14 at 20:20