Is it taboo to take the standard deviation of a very small list of numbers?

Question

I read through a similar question: Is it meaningful to calculate standard deviation of two numbers? However, this was largely focusing on error bars.

If I have five numbers, let's say:

[22.8, 31, 38.5, 48.9, 38.2]

The standard deviation comes out to: 9.7

Maybe there is nothing wrong with this in theory. I can't speak to common practice about summary statistics like this in a 5 observation data set, but I know that more sophisticated things like regressions are really not taken seriously unless there is a meaningful amount of n (maybe several hundred or so).

Application: the target audience is math-educated, not necessarily stats wizards though. We are not trying to tackle anything really cerebral, computing standard deviation for this is just a way to show the spread in a concise way. However, even though we are humble, I still fear there could be backlash for using this, almost inviting criticism on our whole research. Just want to make sure we are not committing any taboos.

Question

Would this computation irk the statistically enlightened group? Why/why not?

score 1 · Accepted Answer · answered Mar 31 '21 at 06:40

An important reason for finding the sample standard deviation $S$ is that $S$ is an estimate of the population standard deviation $\sigma.$ For larger $n$ the estimate is more precise. Confidence intervals can help to give an idea of the precision.

Suppose you are sampling from a normal distribution with standard deviation $\sigma = 5.$ I will show you the method for finding a confidence interval for $\sigma$ based on a sample (because it's no secret), but for purposes of this answer the point is the length of the resulting confidence interval. (You can skip the 'Method' section if it gives more technical detail than you want.)

Method: One can show that $\frac{(n-1)S^2}{\sigma^2} \sim \mathsf{Chisq}(n-1),$ the chi-squared distribution with $n-1$ degrees of freedom. From that distribution I can find boundaries $L$ and $U$ with $P\left(L < \frac{(n-1)S^2}{\sigma^2} < U\right) = 0.95.$ By manipulating inequalities, this becomes $P\left(\frac{(n-1)S^2}{U} < \sigma^2 < \frac{(n-1)S^2}{L}\right) = 0.95.$ and $P\left(S\sqrt{\frac{n-1}{U}} < \sigma < S\sqrt{\frac{n-1}{L}}\right) = 0.95.$ Finally, we say that $\left(S\sqrt{\frac{n-1}{U}}. S\sqrt{\frac{n-1}{L}}\right)$ is a 95% confidence interval for $\sigma.$ Roughly speaking, this means that for 05% of samples this interval includes $\sigma.$

Computation: Suppose I have sample standard deviations $S$ for two samples from a normal population with mean $\sigma = 7.$ The first sample is of size $n=10$ and the second is of size $n = 100.$ Below are simulated samples and their respective confidence intervals, using R statistical software.

Size 10: $S = 7.25.$ The 95% confidence interval is $(4.99, 13.24)$ of length $9.25.$

set.seed(330)
s = sd(rnorm(10, 50, 7));  s
[1] 7.252305
CI = sqrt(9*s^2/qchisq(c(.975,.025),9))
CI;  diff(CI)
[1]  4.988391 13.239882   # CI
[1] 8.251491              # length of CI

Size 100: $S = 7.20.$ The 95% confidence interval is $(6.32, 8.36)$ of length $2.04.$

set.seed(331)
s = sd(rnorm(100, 50, 7));  s
[1] 7.200381
CI = sqrt(99*s^2/qchisq(c(.975,.025),99))
CI;  diff(CI)
[1] 6.321984 8.364505
[1] 2.042521

And just for illustration, without showing the computations, with $n = 2,$ I got $S = 3.34,$ giving a confidence interval $(1.49, 106.60)$ of length $105.11.$ The confidence interval does include $\sigma = 7,$ but it is too long to be of any practical use.

I wouldn't say that your readers would be 'irked' if you give a sample standard deviation for a sample of size $n = 2,$ but they might wonder why you bothered to report it. (Maybe it's better just to give the two values.)

Finally, there is another difficulty with sample standard deviations based on samples of only two observations. they are seriously biased downward as estimates of $\sigma.$ On average, $S$ based on two normal observations will have the value $0.798\sigma.$ That is $E(S) \approx 0.798\sigma.$ [It is always true, for all $n,$ that $E(S) < \sigma,$ but for $n$ as large as 10, the downward bias is seldom of practical importance, and the bias becomes truly negligible for large $n.]$

Thanks, so for n=5, the bias is still a potential concern, if I understand your explanation correctly. — Arash Howaida, Mar 31 '21 at 07:12
Depends on how fussy you are: For $n=5,$ its $E(S) \approx 0.94\sigma.$ Also $E(S^2) = \sigma^2.$ But expectation is a linear operator and so equality doesn't survive taking square roots. — BruceET, Mar 31 '21 at 07:18
More [here](https://stats.stackexchange.com/questions/11707/why-is-sample-standard-deviation-a-biased-estimator-of-sigma). — BruceET, Mar 31 '21 at 07:27

Is it taboo to take the standard deviation of a very small list of numbers?

Question

1 Answers1