22

It came as a bit of a shock to me the first time I did a normal distribution Monte Carlo simulation and discovered that the mean of $100$ standard deviations from $100$ samples, all having a sample size of only $n=2$, proved to be much less than, i.e., averaging $ \sqrt{\frac{2}{\pi }}$ times, the $\sigma$ used for generating the population. However, this is well known, if seldom remembered, and I sort of did know, or I would not have done a simulation. Here is a simulation.

Here is an example for predicting 95% confidence intervals of $N(0,1)$ using 100, $n=2$, estimates of $\text{SD}$, and $\text{E}(s_{n=2})=\sqrt\frac{\pi}{2}\text{SD}$.

 RAND()   RAND()    Calc    Calc    
 N(0,1)   N(0,1)    SD      E(s)    
-1.1171  -0.0627    0.7455  0.9344  
 1.7278  -0.8016    1.7886  2.2417  
 1.3705  -1.3710    1.9385  2.4295  
 1.5648  -0.7156    1.6125  2.0209  
 1.2379   0.4896    0.5291  0.6632  
-1.8354   1.0531    2.0425  2.5599  
 1.0320  -0.3531    0.9794  1.2275  
 1.2021  -0.3631    1.1067  1.3871  
 1.3201  -1.1058    1.7154  2.1499  
-0.4946  -1.1428    0.4583  0.5744  
 0.9504  -1.0300    1.4003  1.7551  
-1.6001   0.5811    1.5423  1.9330  
-0.5153   0.8008    0.9306  1.1663  
-0.7106  -0.5577    0.1081  0.1354  
 0.1864   0.2581    0.0507  0.0635  
-0.8702  -0.1520    0.5078  0.6365  
-0.3862   0.4528    0.5933  0.7436  
-0.8531   0.1371    0.7002  0.8775  
-0.8786   0.2086    0.7687  0.9635  
 0.6431   0.7323    0.0631  0.0791  
 1.0368   0.3354    0.4959  0.6216  
-1.0619  -1.2663    0.1445  0.1811  
 0.0600  -0.2569    0.2241  0.2808  
-0.6840  -0.4787    0.1452  0.1820  
 0.2507   0.6593    0.2889  0.3620  
 0.1328  -0.1339    0.1886  0.2364  
-0.2118  -0.0100    0.1427  0.1788  
-0.7496  -1.1437    0.2786  0.3492  
 0.9017   0.0022    0.6361  0.7972  
 0.5560   0.8943    0.2393  0.2999  
-0.1483  -1.1324    0.6959  0.8721  
-1.3194  -0.3915    0.6562  0.8224  
-0.8098  -2.0478    0.8754  1.0971  
-0.3052  -1.1937    0.6282  0.7873  
 0.5170  -0.6323    0.8127  1.0186  
 0.6333  -1.3720    1.4180  1.7772  
-1.5503   0.7194    1.6049  2.0115  
 1.8986  -0.7427    1.8677  2.3408  
 2.3656  -0.3820    1.9428  2.4350  
-1.4987   0.4368    1.3686  1.7153  
-0.5064   1.3950    1.3444  1.6850  
 1.2508   0.6081    0.4545  0.5696  
-0.1696  -0.5459    0.2661  0.3335  
-0.3834  -0.8872    0.3562  0.4465  
 0.0300  -0.8531    0.6244  0.7826  
 0.4210   0.3356    0.0604  0.0757  
 0.0165   2.0690    1.4514  1.8190  
-0.2689   1.5595    1.2929  1.6204  
 1.3385   0.5087    0.5868  0.7354  
 1.1067   0.3987    0.5006  0.6275  
 2.0015  -0.6360    1.8650  2.3374  
-0.4504   0.6166    0.7545  0.9456  
 0.3197  -0.6227    0.6664  0.8352  
-1.2794  -0.9927    0.2027  0.2541  
 1.6603  -0.0543    1.2124  1.5195  
 0.9649  -1.2625    1.5750  1.9739  
-0.3380  -0.2459    0.0652  0.0817  
-0.8612   2.1456    2.1261  2.6647  
 0.4976  -1.0538    1.0970  1.3749  
-0.2007  -1.3870    0.8388  1.0513  
-0.9597   0.6327    1.1260  1.4112  
-2.6118  -0.1505    1.7404  2.1813  
 0.7155  -0.1909    0.6409  0.8033  
 0.0548  -0.2159    0.1914  0.2399  
-0.2775   0.4864    0.5402  0.6770  
-1.2364  -0.0736    0.8222  1.0305  
-0.8868  -0.6960    0.1349  0.1691  
 1.2804  -0.2276    1.0664  1.3365  
 0.5560  -0.9552    1.0686  1.3393  
 0.4643  -0.6173    0.7648  0.9585  
 0.4884  -0.6474    0.8031  1.0066  
 1.3860   0.5479    0.5926  0.7427  
-0.9313   0.5375    1.0386  1.3018  
-0.3466  -0.3809    0.0243  0.0304  
 0.7211  -0.1546    0.6192  0.7760  
-1.4551  -0.1350    0.9334  1.1699  
 0.0673   0.4291    0.2559  0.3207  
 0.3190  -0.1510    0.3323  0.4165  
-1.6514  -0.3824    0.8973  1.1246  
-1.0128  -1.5745    0.3972  0.4978  
-1.2337  -0.7164    0.3658  0.4585  
-1.7677  -1.9776    0.1484  0.1860  
-0.9519  -0.1155    0.5914  0.7412  
 1.1165  -0.6071    1.2188  1.5275  
-1.7772   0.7592    1.7935  2.2478  
 0.1343  -0.0458    0.1273  0.1596  
 0.2270   0.9698    0.5253  0.6583  
-0.1697  -0.5589    0.2752  0.3450  
 2.1011   0.2483    1.3101  1.6420  
-0.0374   0.2988    0.2377  0.2980  
-0.4209   0.5742    0.7037  0.8819  
 1.6728  -0.2046    1.3275  1.6638  
 1.4985  -1.6225    2.2069  2.7659  
 0.5342  -0.5074    0.7365  0.9231  
 0.7119   0.8128    0.0713  0.0894  
 1.0165  -1.2300    1.5885  1.9909  
-0.2646  -0.5301    0.1878  0.2353  
-1.1488  -0.2888    0.6081  0.7621  
-0.4225   0.8703    0.9141  1.1457  
 0.7990  -1.1515    1.3792  1.7286  

 0.0344  -0.1892    0.8188  1.0263  mean E(.)
                    SD pred E(s) pred   
-1.9600  -1.9600   -1.6049 -2.0114    2.5%  theor, est
 1.9600   1.9600    1.6049  2.0114   97.5%  theor, est
                    0.3551 -0.0515    2.5% err
                   -0.3551  0.0515   97.5% err

Drag the slider down to see the grand totals. Now, I used the ordinary SD estimator to calculate 95% confidence intervals around a mean of zero, and they are off by 0.3551 standard deviation units. The E(s) estimator is off by only 0.0515 standard deviation units. If one estimates standard deviation, standard error of the mean, or t-statistics, there may be a problem.

My reasoning was as follows, the population mean, $\mu$, of two values can be anywhere with respect to a $x_1$ and is definitely not located at $\frac{x_1+x_2}{2}$, which latter makes for an absolute minimum possible sum squared so that we are underestimating $\sigma$ substantially, as follows

w.l.o.g. let $x_2-x_1=d$, then $\Sigma_{i=1}^{n}(x_i-\bar{x})^2$ is $2 (\frac{d}{2})^2=\frac{d^2}{2}$, the least possible result.

That means that standard deviation calculated as

$\text{SD}=\sqrt{\frac{\Sigma_{i=1}^{n}(x_i-\bar{x})^2}{n-1}}$ ,

is a biased estimator of the population standard deviation ($\sigma$). Note, in that formula we decrement the degrees of freedom of $n$ by 1 and dividing by $n-1$, i.e., we do some correction, but it is only asymptotically correct, and $n-3/2$ would be a better rule of thumb. For our $x_2-x_1=d$ example the $\text{SD}$ formula would give us $SD=\frac{d}{\sqrt 2}\approx 0.707d$, a statistically implausible minimum value as $\mu\neq \bar{x}$, where a better expected value ($s$) would be $E(s)=\sqrt{\frac{\pi }{2}}\frac{d}{\sqrt 2}=\frac{\sqrt\pi }{2}d\approx0.886d$. For the usual calculation, for $n<10$, $\text{SD}$s suffer from very significant underestimation called small number bias, which only approaches 1% underestimation of $\sigma$ when $n$ is approximately $25$. Since many biological experiments have $n<25$, this is indeed an issue. For $n=1000$, the error is approximately 25 parts in 100,000. In general, small number bias correction implies that the unbiased estimator of population standard deviation of a normal distribution is

$\text{E}(s)\,=\,\,\frac{\Gamma\left(\frac{n-1}{2}\right)}{\Gamma\left(\frac{n}{2}\right)}\sqrt{\frac{\Sigma_{i=1}^{n}(x_i-\bar{x})^2}{2}}>\text{SD}=\sqrt{\frac{\Sigma_{i=1}^{n}(x_i-\bar{x})^2}{n-1}}\; .$

From Wikipedia under creative commons licensing one has a plot of SD underestimation of $\sigma$ <a title="By Rb88guy (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0) or GFDL (http://www.gnu.org/copyleft/fdl.html)], via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File%3AStddevc4factor.jpg"><img width="512" alt="Stddevc4factor" src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ee/Stddevc4factor.jpg/512px-Stddevc4factor.jpg"/></a>

Since SD is a biased estimator of population standard deviation, it cannot be the minimum variance unbiased estimator MVUE of population standard deviation unless we are happy with saying that it is MVUE as $n\rightarrow \infty$, which I, for one, am not.

Concerning non-normal distributions and approximately unbiased $SD$ read this.

Now comes the question Q1

Can it be proven that the $\text{E}(s)$ above is MVUE for $\sigma$ of a normal distribution of sample-size $n$, where $n$ is a positive integer greater than one?

Hint: (But not the answer) see How can I find the standard deviation of the sample standard deviation from a normal distribution?.

Next question, Q2

Would someone please explain to me why we are using $\text{SD}$ anyway as it is clearly biased and misleading? That is, why not use $\text{E}(s)$ for most everything? Supplementary, it has become clear in the answers below that variance is unbiased, but its square root is biased. I would request that answers address the question of when unbiased standard deviation should be used.

As it turns out, a partial answer is that to avoid bias in the simulation above, the variances could have been averaged rather than the SD-values. To see the effect of this, if we square the SD column above, and average those values we get 0.9994, the square root of which is an estimate of the standard deviation 0.9996915 and the error for which is only 0.0006 for the 2.5% tail and -0.0006 for the 95% tail. Note that this is because variances are additive, so averaging them is a low error procedure. However, standard deviations are biased, and in those cases where we do not have the luxury of using variances as an intermediary, we still need small number correction. Even if we can use variance as an intermediary, in this case for $n=100$, the small sample correction suggests multiplying the square root of unbiased variance 0.9996915 by 1.002528401 to give 1.002219148 as an unbiased estimate of standard deviation. So, yes, we can delay using small number correction but should we therefore ignore it entirely?

The question here is when should we be using small number correction, as opposed to ignoring its use, and predominantly, we have avoided its use.

Here is another example, the minimum number of points in space to establish a linear trend that has an error is three. If we fit these points with ordinary least squares the result for many such fits is a folded normal residual pattern if there is non-linearity and half normal if there is linearity. In the half-normal case our distribution mean requires small number correction. If we try the same trick with 4 or more points, the distribution will not generally be normal related or easy to characterize. Can we use variance to somehow combine those 3-point results? Perhaps, perhaps not. However, it is easier to conceive of problems in terms of distances and vectors.

Carl
  • 11,532
  • 7
  • 45
  • 102
  • Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/49784/discussion-on-question-by-carl-why-we-are-using-a-biased-and-misleading-standard). – whuber Dec 08 '16 at 14:16
  • 3
    Q1: See the Lehmann-Scheffe theorem. – Scortchi - Reinstate Monica Dec 08 '16 at 15:57
  • @Scortchi Helpful (+1), that is what I was looking for for Q1. If you would be so kind as to run through it as an answer, I can edit to put in Q2 to later award the bounty. – Carl Dec 08 '16 at 23:14
  • 1
    Nonzero bias of an estimator is not necessarily a drawback. For example, if we wish to have an accurate estimator under square loss, we are willing to induce bias as long as it reduces the variance by a sufficiently large amount. That is why (biased) regularized estimators may perform better than the (unbiased) OLS estimator in a linear regression model, for example. – Richard Hardy Dec 14 '16 at 20:20
  • @RichardHardy Would prefer to say: (1) Bias is a reduced *accuracy* price one may be willing to pay for increased *precision.* (2) Alternatively, (e.g., for regularization,) bias of one thing (e.g., of fit) can be used to increase accuracy and precision of another (regression parameter target). Question here is when to use what. – Carl Dec 14 '16 at 20:45
  • @Carl, good point, I did not use the proper term there (accuracy vs. precision). I wonder what adjective combine the two, meaning both accurate and precise? Also, how can one define bias of fit? – Richard Hardy Dec 14 '16 at 20:54
  • @RichardHardy Bias of fit can be both theoretically and for application to data defined and quantified as the latent structure or tendency of residuals over their range. – Carl Dec 14 '16 at 21:17
  • @Carl, is it a common use of the word *bias*? Let us not assign another meaning to an already heavily loaded term. Perhaps another term could suit better? – Richard Hardy Dec 14 '16 at 21:23
  • @RichardHardy Best estimator is one candidate for the combination of accuracy and precision, but exactly what that means in general is vague. Certainly, in specific cases, when Method A is both more accurate and precise than Method B, we would have no problem say which is a better estimator. Moreover, either without the other is useless, that is accuracy without precision is just as worthless as precision without accuracy. – Carl Dec 14 '16 at 21:25
  • @RichardHardy 1) There appears to be a problem with how the term bias is used, and 2) some things that are biased are not so recognized, whereas some things that are thought unbiased actually are biased. I also suspect there may be a terminology problem. Perhaps the term 'bias' is being used in different contexts in different fields, e.g., stats versus physics versus math? – Carl Dec 14 '16 at 21:33
  • 3
    @Carl *many* terms are used differently in different application areas. If you're posting to a stats group and you use a jargon term like "bias", you would naturally be assumed to be using the specific meaning(s) of the term particular to statistics. If you mean *anything* else, it's essential to either use a different term or to define clearly what you do mean by the term right at the first use. – Glen_b Dec 15 '16 at 03:35
  • @Glen_b Oh and I do, I never say 'biased' without saying what, how and why. I do not understand bias to be jargon, the meaning either translates exactly into physics and math or it is incorrect. – Carl Dec 15 '16 at 03:45
  • 2
    "bias" is certainly a term of jargon -- *special words or expressions used by a profession or group that are difficult for others to understand* seems pretty much what "bias" is. It's because such terms have precise, specialized definitions in their application areas (including mathematical definitions) that makes them jargon terms. – Glen_b Dec 15 '16 at 03:50
  • @Glen_b My point is that 'bias' is jargon like 'cosine' or 'maximum likelihood' is jargon. – Carl Dec 15 '16 at 04:03
  • @Carl, agree with Glen_b. Honestly, for me discussions with you are often more difficult than usual precisely because you are using well-established statistical terms to denote something quite different than they denote in statistics (which may probably make sense in physics). So purely for convenience, I would recommend sticking to the classical statistical definitions and terms when discussing with statisticians (e.g. here at Cross Validated). – Richard Hardy Dec 15 '16 at 06:51
  • @RichardHardy Trust me, if I knew how to do that, I would. I actually am closest to a tag badge in `terminology` and I am working on perfecting my use of statistical language, and in fact, if reputation per post is any indication, I get more points for that than for anything. – Carl Dec 15 '16 at 07:38
  • @Carl, Glad to hear. You must be on the right track. – Richard Hardy Dec 15 '16 at 08:09

5 Answers5

34

For the more restricted question

Why is a biased standard deviation formula typically used?

the simple answer

Because the associated variance estimator is unbiased. There is no real mathematical/statistical justification.

may be accurate in many cases.

However, this is not necessarily always the case. There are at least two important aspects of these issues that should be understood.

First, the sample variance $s^2$ is not just unbiased for Gaussian random variables. It is unbiased for any distribution with finite variance $\sigma^2$ (as discussed below, in my original answer). The question notes that $s$ is not unbiased for $\sigma$, and suggests an alternative which is unbiased for a Gaussian random variable. However it is important to note that unlike the variance, for the standard deviation it is not possible to have a "distribution free" unbiased estimator (*see note below).

Second, as mentioned in the comment by whuber the fact that $s$ is biased does not impact the standard "t test". First note that, for a Gaussian variable $x$, if we estimate z-scores from a sample $\{x_i\}$ as $$z_i=\frac{x_i-\mu}{\sigma}\approx\frac{x_i-\bar{x}}{s}$$ then these will be biased.

However the t statistic is usually used in the context of the sampling distribution of $\bar{x}$. In this case the z-score would be $$z_{\bar{x}}=\frac{\bar{x}-\mu}{\sigma_{\bar{x}}}\approx\frac{\bar{x}-\mu}{s/\sqrt{n}}=t$$ though we can compute neither $z$ nor $t$, as we do not know $\mu$. Nonetheless, if the $z_{\bar{x}}$ statistic would be normal, then the $t$ statistic will follow a Student-t distribution. This is not a large-$n$ approximation. The only assumption is that the $x$ samples are i.i.d. Gaussian.

(Commonly the t-test is applied more broadly for possibly non-Gaussian $x$. This does rely on large-$n$, which by the central limit theorem ensures that $\bar{x}$ will still be Gaussian.)


*Clarification on "distribution-free unbiased estimator"

By "distribution free", I mean that the estimator cannot depend on any information about the population $x$ aside from the sample $\{x_1,\ldots,x_n\}$. By "unbiased" I mean that the expected error $\mathbb{E}[\hat{\theta}_n]-\theta$ is uniformly zero, independent of the sample size $n$. (As opposed to an estimator that is merely asymptotically unbiased, a.k.a. "consistent", for which the bias vanishes as $n\to\infty$.)

In the comments this was given as a possible example of a "distribution-free unbiased estimator". Abstracting a bit, this estimator is of the form $\hat{\sigma}=f[s,n,\kappa_x]$, where $\kappa_x$ is the excess kurtosis of $x$. This estimator is not "distribution free", as $\kappa_x$ depends on the distribution of $x$. The estimator is said to satisfy $\mathbb{E}[\hat{\sigma}]-\sigma_x=\mathrm{O}[\frac{1}{n}]$, where $\sigma_x^2$ is the variance of $x$. Hence the estimator is consistent, but not (absolutely) "unbiased", as $\mathrm{O}[\frac{1}{n}]$ can be arbitrarily large for small $n$.


Note: Below is my original "answer". From here on, the comments are about the standard "sample" mean and variance, which are "distribution-free" unbiased estimators (i.e. the population is not assumed to be Gaussian).

This is not a complete answer, but rather a clarification on why the sample variance formula is commonly used.

Given a random sample $\{x_1,\ldots,x_n\}$, so long as the variables have a common mean, the estimator $\bar{x}=\frac{1}{n}\sum_ix_i$ will be unbiased, i.e. $$\mathbb{E}[x_i]=\mu \implies \mathbb{E}[\bar{x}]=\mu$$

If the variables also have a common finite variance, and they are uncorrelated, then the estimator $s^2=\frac{1}{n-1}\sum_i(x_i-\bar{x})^2$ will also be unbiased, i.e. $$\mathbb{E}[x_ix_j]-\mu^2=\begin{cases}\sigma^2&i=j\\0&i\neq{j}\end{cases} \implies \mathbb{E}[s^2]=\sigma^2$$ Note that the unbiasedness of these estimators depends only on the above assumptions (and the linearity of expectation; the proof is just algebra). The result does not depend on any particular distribution, such as Gaussian. The variables $x_i$ do not have to have a common distribution, and they do not even have to be independent (i.e. the sample does not have to be i.i.d.).

The "sample standard deviation" $s$ is not an unbiased estimator, $\mathbb{s}\neq\sigma$, but nonetheless it is commonly used. My guess is that this is simply because it is the square root of the unbiased sample variance. (With no more sophisticated justification.)

In the case of an i.i.d. Gaussian sample, the maximum likelihood estimates (MLE) of the parameters are $\hat{\mu}_\mathrm{MLE}=\bar{x}$ and $(\hat{\sigma}^2)_\mathrm{MLE}=\frac{n-1}{n}s^2$, i.e. the variance divides by $n$ rather than $n^2$. Moreover, in the i.i.d. Gaussian case the standard deviation MLE is just the square root of the MLE variance. However these formulas, as well as the one hinted at in your question, depend on the Gaussian i.i.d. assumption.


Update: Additional clarification on "biased" vs. "unbiased".

Consider an $n$-element sample as above, $X=\{x_1,\ldots,x_n\}$, with sum-square-deviation $$\delta^2_n=\sum_i(x_i-\bar{x})^2$$ Given the assumptions outlined in the first part above, we necessarily have $$\mathbb{E}[\delta^2_n]=(n-1)\sigma^2$$ so the (Gaussian-)MLE estimator is biased $$\widehat{\sigma^2_n}=\tfrac{1}{n}\delta^2_n \implies \mathbb{E}[\widehat{\sigma^2_n}]=\tfrac{n-1}{n}\sigma^2 $$ while the "sample variance" estimator is unbiased $$s^2_n=\tfrac{1}{n-1}\delta^2_n \implies \mathbb{E}[s^2_n]=\sigma^2$$

Now it is true that $\widehat{\sigma^2_n}$ becomes less biased as the sample size $n$ increases. However $s^2_n$ has zero bias no matter the sample size (so long as $n>1$). For both estimators, the variance of their sampling distribution will be non-zero, and depend on $n$.

As an example, the below Matlab code considers an experiment with $n=2$ samples from a standard-normal population $z$. To estimate the sampling distributions for $\bar{x},\widehat{\sigma^2},s^2$, the experiment is repeated $N=10^6$ times. (You can cut & paste the code here to try it out yourself.)

% n=sample size, N=number of samples
n=2; N=1e6;
% generate standard-normal random #'s
z=randn(n,N); % i.e. mu=0, sigma=1
% compute sample stats (Gaussian MLE)
zbar=sum(z)/n; zvar_mle=sum((z-zbar).^2)/n;
% compute ensemble stats (sampling-pdf means)
zbar_avg=sum(zbar)/N, zvar_mle_avg=sum(zvar_mle)/N
% compute unbiased variance
zvar_avg=zvar_mle_avg*n/(n-1)

Typical output is like

zbar_avg     =  1.4442e-04
zvar_mle_avg =  0.49988
zvar_avg     =  0.99977

confirming that \begin{align} \mathbb{E}[\bar{z}]&\approx\overline{(\bar{z})}\approx\mu=0 \\ \mathbb{E}[s^2]&\approx\overline{(s^2)}\approx\sigma^2=1 \\ \mathbb{E}[\widehat{\sigma^2}]&\approx\overline{(\widehat{\sigma^2})}\approx\frac{n-1}{n}\sigma^2=\frac{1}{2} \end{align}


Update 2: Note on fundamentally "algebraic" nature of unbiased-ness.

In the above numerical demonstration, the code approximates the true expectation $\mathbb{E}[\,]$ using an ensemble average with $N=10^6$ replications of the experiment (i.e. each is a sample of size $n=2$). Even with this large number, the typical results quoted above are far from exact.

To numerically demonstrate that the estimators are really unbiased, we can use a simple trick to approximate the $N\to\infty$ case: simply add the following line to the code

% optional: "whiten" data (ensure exact ensemble stats)
[U,S,V]=svd(z-mean(z,2),'econ'); z=sqrt(N)*U*V';

(placing after "generate standard-normal random #'s" and before "compute sample stats")

With this simple change, even running the code with $N=10$ gives results like

zbar_avg     =  1.1102e-17
zvar_mle_avg =  0.50000
zvar_avg     =  1.00000
GeoMatt22
  • 11,997
  • 2
  • 34
  • 64
  • Unbiased-ness has to do with the "theoretical expected value" ($\mathbb{E}[\,]$) being correct. For both $\bar{x}$ and $s^2$, the estimators are unbiased for any $n>1$ (or $\geq$, in the case of $\bar{x}$). On the other hand, the **variance** of these estimators *strongly* depends on sample size $n$. Think of it this way: Say we take $N$ samples, each of size $n$. This gives, e.g. $\bar{x}_1,\ldots,\bar{x}_N$ and $s^2_1,\ldots,s^2_N$. "Unbiased" means as $N\to\infty$ the "meta-sample" mean taken over $\bar{x}_i$ and $s^2_i$ will converge ... independent of $n$. So "big $N$" limit, yes. – GeoMatt22 Dec 08 '16 at 06:13
  • @Carl FYI: I vote your comment with "flak" for moderator attention as inappropriate. And +1 to GeoMatt22, sample variance is indeed unbiased. – amoeba Dec 08 '16 at 07:08
  • 3
    @amoeba Well, I'll eat my hat. I squared the SD-values in each line then averaged them and they come out unbiased (0.9994), whereas the SD-values themselves do not. Meaning that you and GeoMatt22 are correct, and I am wrong. – Carl Dec 08 '16 at 07:27
  • 1
    @GeoMatt22 It seems that the variance is unbiased but its square root is not. So one method of producing a less biased SD for a series of SD values is merely to square them, average the variances, and then take the square root. In other words, the order of operations is important. – Carl Dec 08 '16 at 07:51
  • @amoeba Not understanding you. The variance of an $n=2$ calculation may be unbiased but its square root is heavily biased and needs small number correction, go figure. Not a result that I anticipated, but an important factoid nonetheless. That factoid may have some far reaching implications, and that is why I am bothering with this. – Carl Dec 08 '16 at 08:07
  • Yes, that's all correct. Ah, I think on re-reading I understood what you meant in that sentence. Erasing my previous comment. @Carl – amoeba Dec 08 '16 at 08:25
  • 2
    @Carl: It's generally true that transforming an unbiased estimator of a parameter doesn't give an unbiased estimate of the transformed parameter except when the transformation is affine, following from the linearity of expectation. So on what scale is unbiasedness important to you? – Scortchi - Reinstate Monica Dec 08 '16 at 08:29
  • @Scortchi The scale of importance is context dependent, so there is no unique answer that I can offer. Much of what I do is within between 1% to 4% 1 SD total propagated error, but only because I do a lot of optimization. Suboptimal methodology mortifies me. Perhaps the addendum I added to the posted question helps? – Carl Dec 08 '16 at 09:06
  • @GeoMatt2 The question is about standard deviation which is a biased transform of admittedly unbiased variance. As Scortchi pointed out, a non-affine transform, e.g., square rooting, of an unbiased measure is unlikely to be unbiased in general and standard deviation, as used for t-testing, standard error of the mean and myriad other uses underestimates and is biased. Any formula for standard deviation, e.g. with a $n-\pi$ divisor will be unbiased in the limit as $n$ goes to infinity. That is not the criterion for unbiasedness when $n$ is a small integer. Please answer questions asked. – Carl Dec 09 '16 at 06:09
  • 4
    Carl: I apologize if you feel my answer was orthogonal to your question. It was intended to provide a plausible explanation of Q:"why a biased standard deviation formula is typically used?" A:"simply because the associated variance estimator is unbiased, vs. any real *mathematical/statistical* justification". As for your comment, typically "unbiased" describes an estimator whose expected value is correct *independent* of sample size. If it is unbiased only in the limit of infinite sample size, typically it would be called "[consistent](https://en.wikipedia.org/wiki/Consistent_estimator)". – GeoMatt22 Dec 09 '16 at 06:38
  • @GeoMatt22 You say that "sample variance $s^2$ is unbiased for any distribution with finite variance $σ^2$". And... "it is important to note that unlike the variance, for the standard deviation it is not possible to have a "distribution free" unbiased estimator". However, [this](https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation#Other_distributions) lists an estimator for unbiased other distributions. Moreover, one should compute variance for a normal-squared distribution from its [transformed normal distribution](http://stats.stackexchange.com/a/249882/99274) – Carl Dec 10 '16 at 02:38
  • Carl, for your first link, see the section **Clarification on "distribution-free unbiased estimator"** added to my answer just now (too long for a comment). [BTW you should definitely keep on synthesizing things into your answer, as I agree mine is somewhat "tangential/background".] – GeoMatt22 Dec 10 '16 at 03:55
  • 3
    (+1) Nice answer. Small caveat: That Wikipedia passage on consistency quoted in this answer is a bit of a mess and the parenthetical statement made related to it is potentially misleading. "Consistency" and "asymptotic unbiasedness" are in some sense orthogonal properties of an estimator. For a little more on that point, see the comment thread to [this answer](http://stats.stackexchange.com/a/31038/2970). – cardinal Dec 10 '16 at 21:45
  • 3
    +1 but I think @Scortchi makes a really important point in his answer that is not mentioned in yours: namely, that even for Gaussian population, the unbiased estimate of $\sigma$ has higher expected error than the standard biased estimate of $\sigma$ (due to the high variance of the former). This is a strong argument in favour of *not* using an unbiased estimator even if one *knows* that the underlying distribution is Gaussian. – amoeba Dec 13 '16 at 14:52
  • @amoeba: Well, I don't want to make a fetish of MSE either (on the scale of $\sigma$ or any other scale), or I'd've suggested adding more bias & using the estimator $c_1S$ ;) – Scortchi - Reinstate Monica Dec 13 '16 at 15:58
  • 1
    @amoeba really I think Scortchi's answer is by far the best. If it is not to be accepted, we should all at least upvote it enough to get a [populist](http://stats.stackexchange.com/help/badges/39/populist) badge! My "answer" is more a running diary of the meandering discussion. Scortchi's actually addresses the OP question in a well focused manner. – GeoMatt22 Dec 13 '16 at 18:59
  • I will award it a bounty if @Carl does not. – amoeba Dec 13 '16 at 19:40
  • 1
    I intended my answer merely as a supplement to this one & @civilstat's - convenience & convention aren't necessarily trumping "optimality" when $S$ is used as an estimator of $\sigma$ because a precise notion of optimality would have to be tailored to the particular, practical requirements of each analysis. – Scortchi - Reinstate Monica Dec 14 '16 at 13:46
15

The sample standard deviation $S=\sqrt{\frac{\sum (X - \bar{X})^2}{n-1}}$ is complete and sufficient for $\sigma$ so the set of unbiased estimators of $\sigma^k$ given by

$$ \frac{(n-1)^\frac{k}{2}}{2^\frac{k}{2}} \cdot \frac{\Gamma\left(\frac{n-1}{2}\right)}{\Gamma\left(\frac{n+k-1}{2}\right)} \cdot S^k = \frac{S^k}{c_k} $$

(See Why is sample standard deviation a biased estimator of $\sigma$?) are, by the Lehmann–Scheffé theorem, UMVUE. Consistent, though biased, estimators of $\sigma^k$ can also be formed as

$$ \tilde{\sigma}^k_j= \left(\frac{S^j}{c_j}\right)^\frac{k}{j} $$

(the unbiased estimators being specified when $j=k$). The bias of each is given by

$$\operatorname{E}\tilde{\sigma}^k_j - \sigma^k =\left( \frac{c_k}{c_j^\frac{k}{j}} -1 \right) \sigma^k$$

& its variance by

$$\operatorname{Var}\tilde{\sigma}^{k}_j=\operatorname{E}\tilde{\sigma}^{2k}_j - \left(\operatorname{E}\tilde{\sigma}^k_j\right)^2=\frac{c_{2k}-c_k^2}{c_j^\frac{2k}{j}} \sigma^{2k}$$

For the two estimators of $\sigma$ you've considered, $\tilde{\sigma}^1_1=\frac{S}{c_1}$ & $\tilde{\sigma}^1_2=S$, the lack of bias of $\tilde{\sigma}_1$ is more than offset by its larger variance when compared to $\tilde{\sigma}_2$:

$$\begin{align} \operatorname{E}\tilde{\sigma}_1 - \sigma &= 0 \\ \operatorname{E}\tilde{\sigma}_2 - \sigma &=(c_1 -1) \sigma \\ \operatorname{Var}\tilde{\sigma}_1 =\operatorname{E}\tilde{\sigma}^{2}_1 - \left(\operatorname{E}\tilde{\sigma}^1_1\right)^2 &=\frac{c_{2}-c_1^2}{c_1^2} \sigma^{2} = \left(\frac{1}{c_1^2}-1\right) \sigma^2 \\ \operatorname{Var}\tilde{\sigma}_2 =\operatorname{E}\tilde{\sigma}^{2}_1 - \left(\operatorname{E}\tilde{\sigma}_2\right)^2 &=\frac{c_{2}-c_1^2}{c_2} \sigma^{2}=(1-c_1^2)\sigma^2 \end{align}$$ (Note that $c_2=1$, as $S^2$ is already an unbiased estimator of $\sigma^2$.)

Plot showing contributions of bias & variance to MSE at sample sizes from one to 20 for the two estimators

The mean square error of $a_k S^k$ as an estimator of $\sigma^2$ is given by

$$ \begin{align} (\operatorname{E} a_k S^k - \sigma^k)^2 + \operatorname{E} (a_k S^k)^2 - (\operatorname{E} a_k S^k)^2 &= [ (a_k c_k -1)^2 + a_k^2 c_{2k} - a_k^2 c_k^2 ] \sigma^{2k}\\ &= ( a_k^2 c_{2k} -2 a_k c_k + 1 ) \sigma^{2k} \end{align} $$

& therefore minimized when

$$a_k = \frac{c_k}{c_{2k}}$$

, allowing the definition of another set of estimators of potential interest:

$$ \hat{\sigma}^k_j= \left(\frac{c_j S^j}{c_{2j}}\right)^\frac{k}{j} $$

Curiously, $\hat{\sigma}^1_1=c_1S$, so the same constant that divides $S$ to remove bias multiplies $S$ to reduce MSE. Anyway, these are the uniformly minimum variance location-invariant & scale-equivariant estimators of $\sigma^k$ (you don't want your estimate to change at all if you measure in kelvins rather than degrees Celsius, & you want it to change by a factor of $\left(\frac{9}{5}\right)^k$ if you measure in Fahrenheit).

None of the above has any bearing on the construction of hypothesis tests or confidence intervals (see e.g. Why does this excerpt say that unbiased estimation of standard deviation usually isn't relevant?). And $\tilde{\sigma}^k_j$ & $\hat{\sigma}^k_j$ exhaust neither estimators nor parameter scales of potential interest—consider the maximum-likelihood estimator $\sqrt{\frac{n-1}{n}}S$, or the median-unbiased estimator $\sqrt{\frac{n-1}{\chi^2_{n-1}(0.5)}}S$; or the geometric standard deviation of a lognormal distribution $\mathrm{e}^\sigma$. It may be worth showing a few more-or-less popular estimates made from a small sample ($n=2$) together with the upper & lower bounds, $\sqrt{\frac{(n-1)s^2}{\chi^2_{n-1}(\alpha)}}$ & $\sqrt{\frac{(n-1)s^2}{\chi^2_{n-1}(1-\alpha)}}$, of the equal-tailed confidence interval having coverage $1-\alpha$:

confidence distribution for $\sigma$ showing estimates

The span between the most divergent estimates is negligible in comparison with the width of any confidence interval having decent coverage. (The 95% C.I., for instance, is $(0.45s,31.9s)$.) There's no sense in being finicky about the properties of a point estimator unless you're prepared to be fairly explicit about what you want you want to use it for—most explicitly you can define a custom loss function for a particular application. A reason you might prefer an exactly (or almost) unbiased estimator is that you're going to use it in subsequent calculations during which you don't want bias to accumulate: your illustration of averaging biased estimates of standard deviation is a simple example of such (a more complex example might be using them as a response in a linear regression). In principle an all-encompassing model should obviate the need for unbiased estimates as an intermediate step, but might be considerably more tricky to specify & fit.

† The value of $\sigma$ that makes the observed data most probable has an appeal as an estimate independent of consideration of its sampling distribution.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
8

Q2: Would someone please explain to me why we are using SD anyway as it is clearly biased and misleading?

This came up as an aside in comments, but I think it bears repeating because it's the crux of the answer:

The sample variance formula is unbiased, and variances are additive. So if you expect to do any (affine) transformations, this is a serious statistical reason why you should insist on a "nice" variance estimator over a "nice" SD estimator.

In an ideal world, they'd be equivalent. But that's not true in this universe. You have to choose one, so you might as well choose the one that lets you combine information down the road.

Comparing two sample means? The variance of their difference is sum of their variances.
Doing a linear contrast with several terms? Get its variance by taking a linear combination of their variances.
Looking at regression line fits? Get their variance using the variance-covariance matrix of your estimated beta coefficients.
Using F-tests, or t-tests, or t-based confidence intervals? The F-test calls for variances directly; and the t-test is exactly equivalent to the square root of an F-test.

In each of these common scenarios, if you start with unbiased variances, you'll remain unbiased all the way (unless your final step converts to SDs for reporting).
Meanwhile, if you'd started with unbiased SDs, neither your intermediate steps nor the final outcome would be unbiased anyway.

civilstat
  • 1,137
  • 8
  • 11
  • Variance is not a distance measurement, and standard deviation is. Yes, vector distances add by squares, but the primary measurement is distance. The question was what would you use corrected distance for, and not why should we ignore distance as if it did not exist. – Carl Dec 11 '16 at 03:39
  • Well, I guess I'm arguing that "the primary measurement is distance" isn't necessarily true. 1) Do you have a method to work with unbiased variances; combine them; take the final resulting variance; and rescale its sqrt to get an unbiased SD? Great, then do that. If not... 2) What are you going to *do* with a SD from a tiny sample? Report it on its own? Better to just plot the datapoints directly, not summarize their spread. And how will people interpret it, other than as an input to SEs and thus CIs? It's meaningful as an input to CIs, but then I'd prefer the t-based CI (with usual SD). – civilstat Dec 11 '16 at 22:35
  • I do no think that many clinical studies or commercial software programs with $n<25$ would use standard error of the mean calculated from small sample corrected standard deviation leading to a false impression of how small those errors are. I think even that one issue, even if that is the only one, should be ignored. – Carl Dec 11 '16 at 23:00
  • "so you might as well choose the one that lets you combine information down the road" and "the primary measurement is distance" isn't necessarily true. Farmer Jo's house is 640 acres down the road? One uses the appropriate measurement correctly for each and every situation, or one has a higher tolerance for false witness than I. My only question here is when to use what, and the answer to it is not "never." – Carl Dec 12 '16 at 03:11
  • Well, +1 anyway, answer is not bad. – Carl Jul 17 '20 at 06:47
1

This post is in outline form.

(1) Taking a square root is not an affine transformation (Credit @Scortchi.)

(2) ${\rm var}(s) = {\rm E} (s^2) - {\rm E}(s)^2$, thus ${\rm E}(s) = \sqrt{{\rm E}(s^2) -{\rm var}(s)}\neq{\sqrt{\rm var(s)}}$

(3) $ {\rm var}(s)=\frac{\Sigma_{i=1}^{n}(x_i-\bar{x})^2}{n-1}$, whereas $\text{E}(s)\,=\,\,\frac{\Gamma\left(\frac{n-1}{2}\right)}{\Gamma\left(\frac{n}{2}\right)}\sqrt{\frac{\Sigma_{i=1}^{n}(x_i-\bar{x})^2}{2}}$$\neq\sqrt{\frac{\Sigma_{i=1}^{n}(x_i-\bar{x})^2}{n-1}}={\sqrt{\rm var(s)}}$

(4) Thus, we cannot substitute ${\sqrt{\rm var(s)}}$ for $\text{E}(s)$, for $n$ small, as square root is not affine.

(5) ${\rm var}(s)$ and $\text{E}(s)$ are unbiased (Credit @GeoMatt22 and @Macro, respectively).

(6) For non-normal distributions $\bar{x}$ is sometimes (a) undefined (e.g., Cauchy, Pareto with small $\alpha$) and (b) not UMVUE (e.g., Cauchy ($\rightarrow$ Student's-$t$ with $df=1$), Pareto, Uniform, beta). Even more commonly, variance may be undefined, e.g. Student's-$t$ with $1\leq df\leq2$. Then one can state that $\text{var}(s)$ is not UMVUE for the general case distribution. Thus, there is then no special onus to introducing an approximate small number correction for standard deviation, which likely has similar limitations to $\sqrt{\text{var}(s)}$, but is additionally less biased, $\hat\sigma = \sqrt{ \frac{1}{n - 1.5 - \tfrac14 \gamma_2} \sum_{i=1}^n (x_i - \bar{x})^2 }$ ,

where $\gamma_2$ is excess kurtosis. In a similar vein, when examining a normal squared distribution (a Chi-squared with $df=1$ transform), we might be tempted to take its square root and use the resulting normal distribution properties. That is, in general, the normal distribution can result from transformations of other distributions and it may be expedient to examine the properties of that normal distribution such that the limitation of small number correction to the normal case is not so severe a restriction as one might at first assume.

For the normal distribution case:

A1: By Lehmann-Scheffe theorem ${\rm var}(s)$ and $\text{E}(s)$ are UMVUE (Credit @Scortchi).

A2: (Edited to adjust for comments below.) For $n\leq 25$, we should use $\text{E}(s)$ for standard deviation, standard error, confidence intervals of the mean and of the distribution, and optionally for z-statistics. For $t$-testing we would not use the unbiased estimator as $\frac{ \bar X - \mu} {\sqrt{\text{var}(n)/n}}$ itself is Student's-$t$ distributed with $n-1$ degrees of freedom (Credit @whuber and @GeoMatt22). For z-statistics, $\sigma$ is usually approximated using $n$ large for which $\text{E}(s)-\sqrt{\text{var}(n)}$ is small, but for which $\text{E}(s)$ appears to be more mathematically appropriate (Credit @whuber and @GeoMatt22).

Carl
  • 11,532
  • 7
  • 45
  • 102
  • 2
    **A2 is incorrect:** following that prescription would produce demonstrably invalid tests. As I commented to the question, perhaps too subtly: consult any theoretical account of a classical test, such as the t-test, to see why a bias correction is irrelevant. – whuber Dec 09 '16 at 21:24
  • @whuber I will take your word for this for now, as you are rarely incorrect. However, I will investigate further, if for no other reason that I simply do not understand how what you are saying can possibly be correct. – Carl Dec 09 '16 at 21:28
  • 2
    There's a strong meta-argument showing why bias correction for statistical tests is a red herring: if it were incorrect not to include a bias-correction factor, *then that factor would already be included in standard tables* of the Student t distribution, F distribution, etc. To put it another way: if I'm wrong about this, then everybody has been wrong about statistical testing for the last century. – whuber Dec 09 '16 at 21:30
  • @whuber I understand that meta-argument. However, without doing a simulation to confirm that, I just do not believe it. My meta-thought is that the standard tables use actual population values, not sample values. – Carl Dec 09 '16 at 21:33
  • See [Why does this excerpt say that unbiased estimation of standard deviation usually isn't relevant?](http://stats.stackexchange.com/q/33235/17230). It can be helpful when writing the formula for the t-statistic to recall the occurrence of an estimator of the population standard deviation in the denominator, but that's all there is to it. – Scortchi - Reinstate Monica Dec 09 '16 at 21:46
  • 1
    Am I the only one who's baffled by the notation here? Why use $\operatorname{E}(s)$ to stand for $\frac{\Gamma\left(\frac{n-1}{2}\right)}{\Gamma\left(\frac{n}{2}\right)}\sqrt{\frac{\Sigma_{i=1}^{n}(x_i-\bar{x})^2}{2}}$, the unbiased estimate of standard deviation? What's $s$? – Scortchi - Reinstate Monica Dec 09 '16 at 21:58
  • 2
    @Scortchi the notation apparently came about as an attempt to inherit that used in the [linked post](http://stats.stackexchange.com/a/27984/99274). There $s$ is the sample variance, and $E(s)$ is the expected value of $s$ for a Gaussian sample. In this question, "$E(s)$" was co-opted to be a new estimator derived from the original post (i.e. something like $\hat{\sigma}\equiv s/\alpha$ where $\alpha\equiv\mathbb{E}[s]/\sigma$). If we arrive at a satisfactory answer for this question, probably a cleanup of the question & answer notation would be warranted :) – GeoMatt22 Dec 09 '16 at 22:20
  • $s$ is typical notation for standard deviation. I suppose one should use $\hat{\sigma}$ or some such somewhere. Go ahead and edit the texts if you wish, I am a Newbie for statistical notation, if not for the meaning of numbers. – Carl Dec 09 '16 at 22:21
  • Carl: I added a note in my answer with my previous comment, and also trying to get at whuber's point. If you check the Wikipedia links, they should give references for more details. The basic point is that the standard statistical tests **already account** for the bias. – GeoMatt22 Dec 09 '16 at 22:22
  • @GeoMatt22 That is exactly what whuber said, I got that, I just do not fathom how that could possibly be true. There is the problem of substituting an inaccurate underestimate of population standard deviation into a formula that was calibrated for true population standard deviations. – Carl Dec 09 '16 at 22:26
  • Here's your simulation. It outputs a histogram of t-statistics, with or without a bias correction for the denominator, and superimposes the Student t distribution. The correction will be valid if the histogram and theoretical curve agree. `n – whuber Dec 09 '16 at 22:40
  • @whuber I use Mathematica, although I have a copy of R on my computer, but I get the point. – Carl Dec 09 '16 at 22:48
  • @GeoMatt22 $\frac{ \bar X - \mu} {\sigma/\sqrt n}\leftarrow \frac{ \bar X - \mu} {S/\sqrt n}$, iff this has a Student's-$t$ distribution with $n-1$ degrees of freedom? – Carl Dec 09 '16 at 22:48
  • @whuber OK, I buy your argument for the t-test, what about the z-test? – Carl Dec 09 '16 at 22:55
  • 2
    The z-test assumes the denominator is an accurate estimate of $\sigma$. It's known to be an approximation that is only asymptotically correct. If you want to correct it, *don't* use the bias of the SD estimator--just use a t-test. That's what the t-test was invented for. – whuber Dec 09 '16 at 22:58
  • @whuber I can recall a situation in which only the z-test was applicable. If what you are saying is never use the z-test, I think (not sure) that makes certain problems intractable. – Carl Dec 09 '16 at 23:03
  • I'm not saying never to use a z-test. Just because something is an approximation doesn't mean it's to be eschewed. It would be superfluous to discuss the issue, though, because it has been discussed so extensively for the past 100 years: most beginning stats texts will provide useful advice. – whuber Dec 09 '16 at 23:05
  • Carl: For a $z$ test, say you are estimating $\mu\approx\bar{x}$ with sample size $n$, but the error $\sigma$ is dominated by the measurement instrument, which was calibrated previously to estimate $\sigma\approx s_m$, but where $m$ is **much** larger than $n$. – GeoMatt22 Dec 09 '16 at 23:59
  • @GeoMatt22 When calibrating $\sigma$, for the first time, which is not atypical, it would therefore be better to use the unbiased calculation, unless the intended use is for $t$-testing. I still am having a problem with self-consistency for $t$-statistic standard deviation as a standard deviation is an absolute distance measurement, to have that change between a normal distribution and a Student's-t distribution is interesting. – Carl Dec 10 '16 at 00:50
  • Yes. I was referring to a case where you may use a measurement device (e.g. scale, thermometer), and prior to you obtaining the device, the device manufacturer calibrated it against a [standard](https://en.wikipedia.org/wiki/Standard_(metrology)) with e.g. $m=10^3$, and reported $s_m\approx\sigma$. Then say you compare $n=2$ weights of a subject and want to say if any difference is due to measurement error. Then "$\sigma$" could be taken to be "effectively known". (Just as a "for instance" where a $z$-test could be perhaps used.) – GeoMatt22 Dec 10 '16 at 01:37
  • @GeoMatt22 I have only ever been the originator of my own calibrations for novel methods (journal articles), thus my interest may seem like nit-picking, but comes with the territory. Not trying to be difficult, just groping for answers. – Carl Dec 10 '16 at 01:53
  • Carl, no problem. I believe all the "classical significance tests" have no problems ... so long as the population is Gaussian. So the more likely failure mode is that for "small $n$" the central limit theorem has not kicked in, so the Gaussian assumption is more easily violated (e.g. $\bar{x}_n$ is *asymptotically* normal for non-Gaussian $x$, but if $n=2$ this may not be helpful!) – GeoMatt22 Dec 10 '16 at 02:21
  • The basic point here is that one should not average standard deviations, they add by root mean square. And, if one only has two data points, the variance will be OK, but the standard deviation will be underestimated, and markedly at that. – Carl Apr 07 '18 at 08:07
0

I want to add the Bayesian answer to this discussion. Just because your assumption is that the data is generated according to some normal with unknown mean and variance, that doesn't mean that you should summarize your data using a mean and a variance. This whole problem can be avoided if you draw the model, which will have a posterior predictive that is a three parameter noncentral scaled student's T distribution. The three parameters are the total of the samples, total of the squared samples, and the number of samples. (Or any bijective map of these.)

Incidentally, I like civilstat's answer because it highlights our desire to combine information. The three sufficient statistics above are even better than the two given in the question (or by civilstat's answer). Two sets of these statistics can easily be combined, and they give the best posterior predictive given the assumption of normality.

Neil G
  • 13,633
  • 3
  • 41
  • 84
  • How then does one calculate an unbiased standard error of the mean from those three sufficient statistics? – Carl Dec 14 '16 at 17:44
  • @carl You can easily calculate it since you have the number of samples $n$, you can multiply the uncorrected sample variance by $\frac{n}{n-1}$. However, you really don't want to do that. That's tantamount to turning your three parameters into a best fit normal distribution to your limited data. It's a lot better to use your three parameters to fit the true posterior predictive: the noncentral scaled T distribution. All questions you might have (percentiles, etc.) are better answered by this T distribution. In fact, T tests are just common sense questions asked of this distribution. – Neil G Dec 15 '16 at 00:30
  • How can one then generate a true normal distribution RV from Monte Carlo simulations(s) and recover that true distribution using only Student's-$t$ distribution parameters? Am I missing something here? – Carl Dec 15 '16 at 02:57
  • @Carl The sufficient statistics I described were the mean, second moment, and number of samples. Your MLE of the original normal are the mean and variance (which is equal to the second moment minus the squared mean). The number of samples is useful when you want to make predictions about future observations (for which you need the posterior predictive distribution). – Neil G Dec 15 '16 at 03:24
  • Though a Bayesian perspective is a welcome addition, I find this a little hard to follow: I'd have expected a discussion of constructing a point estimate from the posterior density of $\sigma$. It seems you're rather questioning the need for a point estimate: this is something well worth bringing up, but not uniquely Bayesian. (BTW you also need to explain the priors.) – Scortchi - Reinstate Monica Dec 16 '16 at 14:22