2

I have $X$ which has values in ${0, 1, 2}$. And i'd like to know if i could compute a 95% confidence interval for the mean of n samples from this distribution.

I know $P(X=0), P(X=1)$ and $P(X=2)$. I know how to compute the true mean of the random variable mean of n samples $(P(X=0) * 0 + P(X=1) * 1 + 2 * P(X=2))$, but I can't figure out how to compute the true confidence interval. It should not be that hard, but I really am stuck on what to use.

Hopefully it's not a stupid question!!

Thanks

Tereza Tizkova
  • 1,974
  • 2
  • 6
  • 32

2 Answers2

2

The link I provided in the comments may be hard to apply to a general distribution given it presumes a parametric formulation.

I remembered a more straightforward approach that you may find useful. It's based off the ECDF of a distribution, relies on the DKW inequality which allows one to form exact confidence bands around the ECDF ($F_n$) for the CDF ($F$) (sample size of $n$):

$$P\left(\sup_{x\in \mathbb{R}} \left\vert F_n(x) - F(x)\right\vert > \varepsilon \right)\leq 2e^{-2n\varepsilon^2} \implies CI_{1-\alpha} = F_n(x) \pm \sqrt{\frac{\ln(\frac{\alpha}{2})}{2n}}:=F_n(x)\pm \varepsilon_n$$

The CI for the mean is is simply the integral of the upper and lower tail distribution curves formed from the upper and lower bands:

Let's define the "shifted" tail curve as

$$T(\epsilon):= \sum_0^{2n}\left(1-F_n(x)-\epsilon\right)$$

Also, since $X\geq 0, E[X]=\int_0^{\infty} (1-F_X) dx$ we can form our confidence interval for the expected value from the confidence bands:

$$CI_{1-\alpha}\left(E[X]\right) = \left[T(\varepsilon_n),T(-\varepsilon_n)\right]$$

Where $$P\left(E[X] \in CI_{1-\alpha}\left(E[X]\right)\right) \geq 1-\alpha$$

User5678
  • 4,817
  • 1
  • 8
  • 19
  • This is interesting (+1). Just a few questions: – Thomas Jan 27 '22 at 15:20
  • 1- why the constant C does not appear in the confidence interval for F(x) ? – Thomas Jan 27 '22 at 15:20
  • 2- the derived CI for the mean applies only to positive functions right ? ( the OP case by the way... ) – Thomas Jan 27 '22 at 15:22
  • 3- If you constrain your F(x) estimate by that band, with a probability alpha for every x, why the "errors" do not sum up when you evaluate the CI for the mean summing up the lower and upper band ? – Thomas Jan 27 '22 at 15:24
  • ( these are really just questions for my understanding, by far any critic :) ) – Thomas Jan 27 '22 at 15:27
  • Very intersting! Isn't this CI going to be conservative and not sharp? Do you think it'd be a problem for using it to estimate which estimation of the CI when we don't have access to the distribution is the best? I'm gonna look more into this but I was wondering this when I first read your answer. – FluidMechanics Potential Flows Jan 27 '22 at 19:35
  • @Thomas -- there is no constant "C" -- that is just part of my notation for "confidence interval" $CI_{\alpha}$ refers to a confidence interval – User5678 Jan 27 '22 at 23:27
  • @Thomas for 2 - yes, but you can generalize to any bounded distribution by translating it, forming the interval, then translating back. – User5678 Jan 27 '22 at 23:31
  • @Thomas -3- what I gave was a confidence band for the entire CDF, not a pointwise interval for each $x$ -- the overage at each point will be higher than the nominal coverage. – User5678 Jan 27 '22 at 23:32
  • @FluidMechanicsPotentialFlows since it is a nonparametric method, its confidence level will generally be conservative. By the definition of confidence intervals, that is fine -- we only require that infimum of the coverage of the interval accross all parameter values is at least the stated nominal level. As pointed out by others, there is no one true CI. You can certainly compare it to approximate CIs, where the coverage probability can be lower than the nominal level. – User5678 Jan 27 '22 at 23:35
  • @Bey would you say it is a relatively sensible idea to compare the conservative confidence intervals generated with the DKW inegality to the classic approximate ones with the Student t distribution / normal distribution to show how off they are as a function of n and as a function of true parameters (i.e. maybe there are some parameters that make it harder for those approximate CIs to be "good" - I've seen it was the case for the binomial distribution in a paper similar to the one you mentioned) – FluidMechanics Potential Flows Jan 28 '22 at 00:12
  • ( by the way I got confused because on Wikipedia they have this DWK inequality with a constant C but than it was proven that C=2 works, that is what you used ) – Thomas Jan 28 '22 at 01:18
1

One simple thing that one can always try, following Casella&Berger, is to build an approximate confidence interval. This has the advantage that does not depend on assumptions about the distributions but is correct only for large sample sizes. I add it in case the OP is not familiar with the procedure.

From the CLT and Slutsky's theorem we have always an asymptotical Pivot statistics:

$$T= \frac{\overline{X}-\mu}{S/\sqrt{n}}$$

, where $\overline{X}$ is the sample mean and $S$ the sample standard deviation.

For large $n$ T tends in distribution to $N(0,1)$. Therefore, calling $z_{\alpha}$ as usual the value such that $\alpha=P(Z>z_{\alpha})$, an asymptotic $\alpha$ confidence interval for the mean of the distribution is:

$$\overline{X}-z_{\frac{1-\alpha}{2}}\frac{S}{\sqrt{n}}<\mu<\overline{X}+z_{\frac{1-\alpha}{2}}\frac{S}{\sqrt{n}}$$

Of course, this works for large $n$ but I think applies also in your case. I am sure there are also better small sample estimators.

Thomas
  • 3,849
  • 2
  • 13
  • 19
  • Hello, that is very clear, and one application of knowing the "true" CI would for example be to compare this approximate CI with the true one as the sample size gets bigger. But procedure for which I'd need to know the true CI. I'm currently reading https://ecommons.cornell.edu/bitstream/handle/1813/32943/BU-839-M.pdf;jsessionid=96C3EDEB9A505D8015D17A31B360B1A4?sequence=1 but it's a bit complicated for me (although i graduated in maths and have a masters in maths ... rip) – FluidMechanics Potential Flows Jan 26 '22 at 15:04
  • tl;dr i'm keeping this in mind and when i'll infer the true CI i'll compare the approximate one with the true one out of curiosity – FluidMechanics Potential Flows Jan 26 '22 at 15:06
  • Yes write an answer/let me know if you make such comparison :) . One thing to notice is that I do not think it exists THE "true" CI. There can be many "correct" confidence intervals, i.e. random intervals that contain the true value with a certain probability $\alpha$. But of course it would be nice to compare at least ONE correct CI with the only "asymptotically correct" given by the CLT. – Thomas Jan 26 '22 at 15:10
  • I think there is one unique correct CI if we enforce that the true value is at the center of the interval because any bigger intervall would have greater probability and any smaller one would have smaller probability (unless some values have 0 probability of happening but i don't think that is the case here but if it is the case then absolutely there could be several CI in which case I guess the definition of the CI would have to be "smaller interval such that ..." - it might actually be the case that a formal definition of a CI contains "smaller interval such that") – FluidMechanics Potential Flows Jan 26 '22 at 15:24
  • I am not really sure of how much the CI is unique. E.g. I guess different Pivot statistics could lead to different C.I. . But I am not sure I am qualified enough to answer better, we should open a new question I think :) – Thomas Jan 26 '22 at 15:28
  • Here https://math.stackexchange.com/questions/4346285/asymmetric-confidence-intervals#comment9100322_4346285 a related question – Thomas Jan 26 '22 at 15:29
  • Oh yes, I think where i talk about true CI i talk about if we have the actual distribution of the mean so no sample size comes into play – FluidMechanics Potential Flows Jan 26 '22 at 15:32
  • EDIT : i actually think my last comment doesn't make sense. A CI necessarily needs a sample size to come into play. – FluidMechanics Potential Flows Jan 27 '22 at 14:36
  • Yes I do not have the time to follow well the conversation, but to me a random interval is defined by two statistics ( therefore a function of the sampls X=X1,...Xn ): [U(X),W(X)], and is an alpha level confidence interval if the parameter belongs to the random interval with a probability alpha. So two CI are different if they are different as functions of the sample. – Thomas Jan 27 '22 at 15:10
  • Just a question that crossed my mind: isn't an asymptotic α confidence interval a bit paradoxical because asymptotically, the CI gets smaller and smaller and converges to [true value ; true value] ? – FluidMechanics Potential Flows Jan 27 '22 at 18:34
  • Well of course every CI will be smaller as n gets large. But yes the asymptotic one will be more accurate for n large and smaller CI. But of course the hope is that it becomes accurate for reasonably large n, that are of interest for the problem analyzed. – Thomas Jan 27 '22 at 20:29
  • I just felt like yes it will be more accurate but n will be big enough that CI won't really mean anything anymore because it will be +/- 10^-3 or something. But I get your point. – FluidMechanics Potential Flows Jan 27 '22 at 21:13
  • 1
    Well even in that case at least you now that the confidence interval is small and its order of magnitude. Otherwise where would you take that info from? But yes I think we think more or less the same :) – Thomas Jan 27 '22 at 21:18