1

May you help me to decide what is the minimal sample size for a uniform distributed sample.

Assume that I've find the sample average, standard deviation and the $\alpha$.

Elvis
  • 11,870
  • 36
  • 56
nba
  • 21
  • 1
  • 2
  • 3
    The minimum sample size *for what purpose* ? – Elvis Dec 22 '11 at 09:58
  • This question appears so closely related to the one at http://stats.stackexchange.com/q/3121 that perhaps the answers and comments there will be helpful here, too. – whuber Dec 22 '11 at 14:11

2 Answers2

3

After discussion in the comments, I rewrite most of this answer...

It seems that the question can be interpreted as "how to find a confidence interval for the mean".

1. Without any assumption on the distribution

You sample independent $X_1, \dots, X_n$ from an unknown distribution with finite mean and variance $\mu$ and $\sigma^2$. Using central limit theorem, for $n$ big enough, you can approximate the distribution of the sample mean $\overline X = {1\over n} \sum_{i=1}^n X_i$ by a $\mathcal N \left( \mu, {1\over n}\sigma^2\right)$.

Approximating $\sigma^2$ by $\widehat{\sigma^2} = {n \over n-1} \left( {1\over n} \sum_i X_i^2 - \left(\overline X\right)^2\right)$, an asymptotic confidence interval of level $1-\alpha$ on $\mu$ is given by $$ \left[ \overline X - z_{1-\alpha/2} \sqrt{\widehat{\sigma^2} \over n} ; \overline X + z_{1-\alpha/2} \sqrt{\widehat{\sigma^2} \over n} \right],$$ where $z_{1-\alpha/2}$ is a quantile of the standard normal distribution (eg for $\alpha = 0.05$, $z_{0.975} = 1.96$).

This can be useful to compute a rough estimation of the number of supplementary samples you may need to collect (just use the current estimate of $\widehat{\sigma^2}$ to compute $n$).

2. Assuming that the distribution is uniform

Note that this assumption shouldn’t be made without serious reasons.

So you sample independent $X_1, \dots, X_n$ ($n > 1$) from a uniform distribution $\mathcal U (a,b)$, the bounds of the interval $[a,b]$ being unknown paramaters. The esperance is ${1\over 2}(a+b)$.

The maximum likelihood estimators of $a$ and $b$ are $m = \min_i X_i$ and $M = \max_i X_i$. These are not independant so we consider the density of $(m,M$) which is given by $$\phi(u,v) = \left\{ \begin{array}{ll} {n(n-1) \over (b-a)^n} (v-u)^{n-2} & \mbox{if}\ 0 \le u \le v \le 1 \\ 0 & \mbox{else} \end{array}\right.$$

From this it easy to get the density of ${1\over 2}(m+M)$ which is very concentrated on ${1\over 2}(a+b)$ :

$$f(t) = \left\{ \begin{array}{ll} {n 2^{n-1} \over (b-a)^n} (t-a)^{n-1} & a\le t \le {1\over2}(a+b) \\ {n 2^{n-1} \over (b-a)^n} (b-t)^{n-1} & {1\over2}(a+b)\le t \le b \\ \end{array}\right.$$

To get a confidence interval of the form $$\left[ {1\over2}(m+M) - {\gamma\over 2}(M-m) ;{1\over2}(m+M) + {\gamma\over 2}(M-m) \right],$$ we compute $$\mathbb P \left( {1\over2}(m+M) - {\gamma\over 2}(M-m) \le {1\over2}(a+b) \le {1\over2}(m+M) + {\gamma\over 2}(M-m) \right)$$ which is simply equal to $1 - {1\over (1+\gamma)^{n-1}}$, so to get an CI of level $1-\alpha$ you just put $\gamma = e^{-{1\over n-1}\log \alpha} - 1$. This procedure gives a suprisingly small CI (or more precisely, its size decreases surprisingly fast as $n$ increases; for big $n$ it is $\sim -{1\over n}\log \alpha$).

3. Short illustration with R and $n = 50$, $n=1000$

It is plain that the CI obtained by the second method is much shorter. Just an illustration of this:

> n <- 50
> x <- runif(n)
> gamma <- exp(-log(0.05)/(n-1)) - 1
> m <- min(x); M <- max(x)
>  0.5*( (m+M) + c(-1,1)*gamma*(M-m) )
[1] 0.4799359 0.5404375
> mean(x) + c(-1,1)*1.96*sd(x)/sqrt(n)
[1] 0.4290694 0.5892294

And with $n=1000$

> n <- 1000
...
>  0.5*( (m+M) + c(-1,1)*gamma*(M-m) )
[1] 0.4984559 0.5014571
> mean(x) + c(-1,1)*1.96*sd(x)/sqrt(n)
[1] 0.4805048 0.5161008

4. To understand the discussion in the comments, some elements from the original answer

In the original answer, I foolhardishly proposed to use the CI procedure based on central limit theorem, but using as an estimator of the variance, which is ${1 \over 12} (b-a)^2$, the quantity ${1\over 12} (M-m)^2$. This was a curious mixture, I thank everybody for this stimulating discussion.

Elvis
  • 11,870
  • 36
  • 56
  • This estimator of the standard deviation is unusual. Is it superior (in any way) to the usual estimator based on the sample SD? – whuber Dec 22 '11 at 15:40
  • @whuber Hum, I was focalised on the keyword "uniform distribution"... I don’t have time now (and won’t have time in the next days) to dig this, but quickly: this estimator is biased, which is bad but it could be debiased; the great quality of this estimator is that its standard error is proportionnal to 1/n, whereas the classical estimator as a standard error proportionnal to $1/\sqrt n$! – Elvis Dec 22 '11 at 18:15
  • @whuber If you want to see the full computation, I could do that – I never did it, I am confident it is tractable, and curious myself to see it. However I’ll have to ask you to wait until 2012, I’ll be off... Remind me that later in a way or an other. – Elvis Dec 22 '11 at 18:18
  • The $1/n$ behavior derives from the assumption of uniformity. I suspect similarly good asymptotic behavior may attach to the usual estimator when this assumption is made. (Bias is not the dominant issue here, because it's a $O(1/n)$ error.) I have also been wondering whether a linear combination of order statistics might do better than the usual SD. It certainly cannot do any worse than your proposal, which is already in that form, and seems likely to do much better--but I haven't the time to do the calculations. – whuber Dec 22 '11 at 18:20
  • I generated 100,000 samples of size 100 from a $U(0,1)$ distribution. The sample standard deviation had a standard deviation of 0.013; the MLE estimate had a standard deviation of 0.004. The RMSE of the sample standard deviation was 0.013, the RMSE of the MLE estimate was 0.007 (up from the sd of 0.004 due to the bias.) Hmmm, should I run into this problem, I'll estimate the standard deviation differently now! – jbowman Dec 22 '11 at 18:27
  • I generated 50 000 samples from size 10 to ~ 1000 by steps of 50, and ploted (size, sd) to verify my assertions! The standard error of the usal estimator for σ is really in $1/\sqrt n$. The uniform distribution is the classical example where MLE for the bounds converge exceptionnaly fast. I hadn’t this in mind when I used this estimator, I just wanted to use plainly the information that the distribution is uniform. I don’t have time either, but this is a nice problem (the last proposition of whuber is very interesting). – Elvis Dec 22 '11 at 19:55
  • 1
    @whuber I think min Xi and max Xi are sufficient statistics for the uniform distribution. Wouldn’t that mean that using other order statistics can’t improve things? – Elvis Dec 22 '11 at 19:57
  • 1
    Good point--I had overlooked that well-known fact. I think what's nagging at me, though, is how unusual it is to *know* that the parent distribution truly is uniform. I would question that assumption in this particular case (other questioners on SE sites have [mistakenly used "uniform" to refer to decidedly non-uniform distributions](http://stats.stackexchange.com/q/7542)). In general, one should favor methods that are robust to failures of distributional assumptions. The main worry about using the extremes of the sample is that they are as non-robust as one can get. – whuber Dec 22 '11 at 21:37
  • @whuber I agree... – Elvis Dec 23 '11 at 07:34
  • 1
    Ignoring for a moment the important caveats that @whuber mentions, it is curious that you use $\bar{X}$ for the mean estimate but a variance estimate using the order statistics. Using the order statistics as an estimate of the mean will yield convergence with variance $O(n^{-2})$ instead of the $O(n^{-1})$ from using the sample mean. Note also that $\hat{\sigma}_n^2 \uparrow \sigma^2 = (b-a)^2/12$, providing an easy proof of (a) almost sure convergence and (b) negative, but negligible, bias of the variance estimate given. It is straightforward (though perhaps inadvisable) to correct the bias. – cardinal Dec 23 '11 at 14:59
  • @cardinal you are totally right, I had the same thoughts. :) – Elvis Dec 23 '11 at 16:08
  • @cardinal I hope you’ll like it this way! – Elvis Dec 27 '11 at 21:48
  • @whuber I think you might like to read the improved answer. This answer is far from being complete (I skipped most gory details, I didn’t give the variances, or the CI you can get on the parameters a and b), but I think that for this question it’s enough. However if some of you want to see more of it... I could write more. – Elvis Dec 27 '11 at 22:25
1

The minimum sample size for estimating the sample average is 1. The minimum sample size for estimating the standard deviation is 2. I don't know about the minimum sample size for estimating $\alpha$, as I don't know what you intend $\alpha$ to signify.

onestop
  • 16,816
  • 2
  • 53
  • 83
  • I want to find the connection between the sample size and the siginification of the sample average. (then I'll be able to decide what sample size is sufficent) – nba Dec 22 '11 at 11:25
  • @nba What do you mean by *signification of the sample average*? Your question is far from being clear. First of all, is that a continuous or a discrete uniform distribution? On which set? Is it for example that you are sampling a uniform distribution on $[0,\theta]$, and you want to estimate $\theta$? Tell us what you want to do. – Elvis Dec 22 '11 at 13:14
  • 1
    Technically, the minimum sample size for estimating any parameter is *zero* ;-) – whuber Dec 22 '11 at 13:26
  • I was measure a data for a system. the number of measurment was N, I want to know if I need to Measure more. The purpose is to achieve a good estimation for the mean (the data best fited to a continious uniform distribution, by a calulation on the data). – nba Dec 22 '11 at 13:29
  • @nba "Good" is relative, which is why you are receiving such cryptic responses. The uncertainty in your estimate should shrink as sample size increases, and if you can quantify "good estimation" a bit then folks here can help you more. – Michael McGowan Dec 22 '11 at 14:36