Sample Variance and Dividing by $n-1$

Question

In this video...

https://www.youtube.com/watch?v=sHRBg6BhKjI

...and in many others, the explanation for why when calculating the sample variance we divide by $n-1$ instead of by $n$ is the following:

For any value of $\mu_x$, the sum of the squared differences of the data points in the sample from the true mean will always be greater than the sum of the squared distances of the data points in the sample from the sample mean.

That's because the sum $\sum{[(x-\bar{X})^2]}$ is minimized when $\bar{X}$ is the sample mean, instead of some other number (that is, instead of the population mean $\mu_x$).

Had we used the population mean $\mu_x$, we would've gotten a larger sum of squared distances. Since we can't possibly know the population mean, we use the sample mean and divide by $n-1$ to make the sample variance a little bigger.

But, although I understand the above, I still don't see why it implies that the sample variance will better approximate the true population variance.

Although the sum of the squared distances of the datapoints in the sample from the sample mean will always be less than the sum of the squared distances of the datapoints in the sample from the population mean, why does that imply that it will be less than the variance of the actual population, where we're not just using a sample?

Does this help? https://stats.stackexchange.com/questions/3931/intuitive-explanation-for-dividing-by-n-1-when-calculating-standard-deviation/3934#3934 — Michael Lew, Nov 16 '19 at 06:07

Glen_b · Accepted Answer · 2019-11-21T22:36:01.757

A somewhat intuitive argument (though one that can be made rigorous):

The population variance is itself a population average. Specifically if you define a new variable to be the square of the difference of the original variable from its population mean, $Y=(X-\mu_X)^2$ (NB when using capital letters I am referring to random variables, rather than their realizations), then the expected value (population mean) of the new variable is the variance of the original one.

The n-denominator variance from the population mean is the corresponding sample average for observations from the distribution of $Y$. Sample averages are unbiased estimators of their population counterpart - that is, the expected value of a sample average IS the population mean.

So if we were able to calculate an average of samples filled with $(X_i-\mu_X)^2$ values, it would be an unbiased estimator of the variance of the $X$-distribution (that is, correct on average, over many such samples).

Let $\bar{Y}$ be the mean of a sample taken from the $Y$ distribution.

Computationally speaking, what we're saying is that $E[\bar{Y}] = \mu_Y = Var(X)$.

Written out in full, $E[\sum{(X_i-\mu_X)^2}/n] = Var(X)$. The expected value of the average of a sample from the $Y$ distribution is the variance of $X$. This is what it means for the sample average from the $Y$ distribution to be an "unbiased" estimator of $Var(X)$.

If we were to replace $\mu_X$ with the average of the $X_i$ sample, $\bar{X}$, that sum would always become smaller, and thus the overall expected value would become smaller.

Since it's smaller than something that is unbiased (the only exception is when the variance is 0), it is therefore biased (specifically, biased downward, too small on average).

I love your answer - it's intuitive and concise. Thank you! Two things: first of all, I made a little edit for myself because I was confused between the third and fourth section. Secondly, could you finish by explaining why dividing by $n-1$ fixes the bias? Thanks again! — joshuaronis, Nov 16 '19 at 21:40
For that I think you really need mathematics and you already have that in the other answer (and indeed multiple demonstrations of it are onsite already -- that part is certainly a duplicate of previous questions). I don't see a good way to establish that via intuition (either it's overly hand-wavy or it ends up repeating the mathematics in words). I'll have to check your proposed edit more carefully later, it's pretty extensive and some of it is not what I'd say. — Glen_b, Nov 17 '19 at 01:41

knrumsey · Answer 2 · 2019-11-16T23:12:01.920

Let $X_1, X_2, \cdots X_n$ be iid with mean $\mu$ and variance $\sigma^2$. Lets look at the class of estimators $$S^2_j = \frac{1}{n-j}\sum_{i=1}^n(X_i- \bar X)^2$$

Using this notation, $S_1^2$ is the usual sample variance and $S_0^2$ is the variant where we divide by the sample space.

The sample variance is unbiased for $\sigma^2$

The derivation of this fact is fairly straightforward. Let's start by finding the expected value of $S_j^2$ for all $j$.

\begin{align} E(S_j^2) &= \frac{1}{n-j}E\left(\sum_{i=1}^n(X_i- \bar X)^2 \right) \\ &= \frac{1}{n-j}E\left(\sum_{i=1}^nX_i^2 - n\bar X^2\right) && \text{"short-cut formula"} \\ &= \frac{1}{n-j}\left(\sum_{i=1}^nE(X_i^2) - nE(\bar X^2)\right) \\ &= \frac{1}{n-j}\left(\sum_{i=1}^n(Var(X_i) - E(X_i)^2) + n(Var(\bar X) + E(\bar X)^2)\right) \\ &= \frac{1}{n-j}\left(n(\sigma^2 + \mu^2) - n(\sigma^2/n + \mu^2)\right)\\[1.2ex] &= \frac{n-1}{n-j}\sigma^2. \end{align}

The bias for this class of estimators is therefore $$B(S_j^2) = E(S_j^2) - \sigma^2 = \frac{j-1}{n-j}\sigma^2$$ which is clearly equal to $0$ if (and only if) $j=1$.

MSE under normality

Mean squared error is a popular criteria for evaluating estimators which considers the bias-variance tradeoff. Lets consider the case where $X_1, \cdots X_n \stackrel{\text{iid}}{\sim} N(\mu, \sigma^2)$. Under normality, we can show that

$$\frac{(n-j)S_j^2}{\sigma^2} \sim \chi^2(n-1).$$

The expected value (and hence the bias) is the same as before. The chi-square result provides an easy way of calculating the variance for this class of estimators.

Since the variance of a $\chi^2(v)$ RV is $2v$, we have that

$$\text{Var}\left(\frac{(n-j)S_j^2}{\sigma^2}\right) = 2(n-1).$$

We also have that $$\text{Var}\left(\frac{(n-j)S_j^2}{\sigma^2}\right) = \frac{(n-j)^2}{\sigma^4}\text{Var}(S_j^2).$$

Putting these pieces together implies that $$Var(S_j^2) = \frac{2\sigma^4(n-1)}{(n-j)^2}.$$

Therefore the MSE of $S_j^2$ is

$$MSE(S_j^2) = B(S_j^2)^2 + Var(S_j^2) = \sigma^4\left(\frac{2(n-1) + (j-1)^2}{(n-j)^2} \right)$$

Here is a plot of the MSE as a function of $j$ for $\sigma = 1$ and $n=30$.

According to MSE, the method of moments (divide by $n$) estimator $S_0^2$ is preferable to the sample variance $S_1^2$. The truly surprising result here is that the "optimal" estimator according to MSE is $$S_{-1}^2 = \frac{1}{n+1}\sum_{i=1}^n(X_i- \bar X)^2.$$

Despite this result, I've never seen anybody use this as an estimator in practice. The reason this happens, is that MSE is exchanging bias for a reduction in variance. By artificially shrinking the estimator towards zero, we get an improvement in MSE (this is an example of Stein's Paradox).

So is the sample variance a better estimator? It depends on your criteria and your underlying goals. Although dividing by $n$ (or even, strangely, by $n+1$) leads to a reduction in MSE, it is important to note that this reduction in MSE is negligible when the sample size is large. The sample variance has some nice properties including unbiasedness which leads to its popularity in practice.

The maximum likelihood estimator for the variance is biased $(j=0)$. The Least Squares estimator of the variance is indeed unbiased $(j=1)$. — Nadia Merquez, Nov 16 '19 at 10:00
I second the objection about calling the normal variance MLE an unbiased estimator of variance. — Dave, Nov 16 '19 at 10:30
Might be handy to note that for a large sample size $n\rightarrow \infty$, the small bias of $j=0$ instead of the correct $j=1$ becomes negligible. — Nadia Merquez, Nov 16 '19 at 13:18
@NadiaMerquez point taken about the bias of the MLE. I misspoke. Answer has ben corrected. — knrumsey, Nov 16 '19 at 14:30
Thanks for your answer...I'm getting lost in the second step. How did you get that "shortcut" formula? — joshuaronis, Nov 16 '19 at 17:11
Hmm....and another question, on steps 3, 4, and 5. I understand that, on the right, $E[\bar{X}^2]=Var(X)/n + (\mu_X)^2$. However, on the left, how come $E[(X_i)^2]$ is equal to $Var(X)+E[X]^2$? $X_i$ represents our random datapoint taken from our SAMPLE, not from the actual population. So, on the left shouldn't it be the variance of our ***sample*** squared plus the mean of our ***sample*** squared, and the left part and the right part wouldn't combine in the following steps? Thanks! — joshuaronis, Nov 16 '19 at 21:59
I asked the same question on this youtube video: https://www.youtube.com/watch?v=D1hgiAla3KI&t=12s — joshuaronis, Nov 16 '19 at 22:04
@JoshuaRonis You can find a derivation (and more information) by typing: "shortcut formula for sample variance" into a google search. As for your second question.. before we collect the data, a statistic (i.e. $\bar X$ or $S^2$) can be thought of as a random variable. A different sample from the population will lead to a different value for the statistic. When we examine properties of estimators, we take this into account. So yes, $E(X_i^2) = Var(X) + E(X)^2$ is the correct way of thinking about this. — knrumsey, Nov 16 '19 at 23:09

Sample Variance and Dividing by $n-1$

2 Answers2

The sample variance is unbiased for $\sigma^2$

MSE under normality