Dividing by degrees of freedom

Question

When estimating parameters such as (I don't care about this specific instance particularly) Variance of a random variable X, one usually adopts Bessel's correction, i.e. using the formula $\hat{Var}{(X)} = \frac{1}{n-1}\sum_i^n(x_i -\bar{x})^2$.

The justification given on Wikipedia and on all other sources I've found is either of the nature of:

the $n-1$ factor arises from dividing by the degrees of freedom of the residual terms
the $n-1$ factor ensures unbiasedness
the $n-1$ factor arises to correct from underestimating the variance if we weren't to include it

However, why does it make sense to divide by the degrees of freedom?

In general, it seems pretty common to divide parameter estimates not by $n$, the number of sample points used to calculate them, but by $df$. Why does this generally make sense?

EDIT: to clarify my question, what I'm asking is whether in a general setting dividing a an uncorrected estimate by it's degrees of freedom will produce a unbiased estimator or an estimator with desirable properties. It seems like this procedure is common but I have not seen a general proof (and don't know if it exists) of why this would work generally.

In particular, I think that the reason would be probably in terms of dimensions of subspaces or connecting back to the degrees of freedom of distributions (that seems closely related).

For individual estimates like sample variance or the MLR residual standard error $\frac{RSS}{n- k-1}$ I am aware that proofs of unbiasedness exist, but they are specific to the problem at hand.

https://stats.stackexchange.com/questions/3931/intuitive-explanation-for-dividing-by-n-1-when-calculating-standard-deviation/169207#169207 . Related, but doesn't address covariance. — Mark L. Stone, Apr 22 '18 at 01:59
Yes, you are mistaken about the covariance -- the df is $n-1$ for the covariance as well as for the variances. There are several ways to see this. I don't have time to write them out but hopefully someone else will do so for you. The ultimate justification for df is rather deeper than any of the usual justifications that you mention so far -- the df is the dimension of the multivariate space that the residual vector belongs to, and all the other justifications flow from that basic fact. — Gordon Smyth, Apr 22 '18 at 06:37
The covariance question at the end is asked and answered at https://stats.stackexchange.com/questions/142456. It's not clear what you mean by "make sense" or even by "df": what context do you have in mind and how is your "df" computed? The issues about dividing by $n-1$ are addressed at https://stats.stackexchange.com/questions/3931. — whuber, Apr 22 '18 at 14:33
I guess what I was looking for specifically was a formal reason of why dividing an estimate by its degrees of freedom would give a desirable estimate _in the general context_ (i.e. linear algebra explanation of what does the dimension of the subspace in which the residuals $x_i - \bar{x}$ have to do with unbiasedness). Another example is in MLR with $\frac{RSS}{n-k-1}$. I understand that most of these cases can be explained individually, but is there any deeper understanding that connects them? Also, does dividing an estimate by it's df guarantee unbiasedness in the general case and why? — mdc, Apr 22 '18 at 18:00
@whuber with df I just mean "degrees of freedom" as defined [here](https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)). In particular I'm interested in df in the context of parameter estimation under the definition: df = "the number of independent observations in a sample of data that are available to estimate a parameter of the population from which that sample is drawn". — mdc, Apr 22 '18 at 18:14
We don't divide parameter estimates by the df "in general". Eg, the mean is not calculated that way. It's true for the variance, but the reason is specific to the variance, it isn't a universal principle. — gung - Reinstate Monica, Apr 23 '18 at 00:50
@gung The mean could be considered calculated that way, as there are $n$ degrees of freedom in calculating the mean, and we divide by $n$. But yeah, I guess it might just be a one-off approach. In the end it does seem that the "divide by df" approach is mainly used in variance-related estimations (estimating $\sigma_{\epsilon}$ in MLR, estimating variance and covariance). — mdc, Apr 23 '18 at 07:21

score 2 · Answer 1 · answered Apr 22 '18 at 03:28

Bessel's correction is adopted to correct for bias in using the sample variance as an estimator of the true variance. The bias in the uncorrected statistic occurs because the sample mean is closer to the middle of the observations than the true mean, and so the squared deviations around the sample mean systematically underestimates the squared deviations around the true mean.

To see this phenomenon algebraically, just derive the expected value of a sample variance without Bessel's correction and see what it looks like. Letting $S_*^2$ denote the uncorrected sample variance (using $n$ as the denominator) we have:

$$\begin{equation} \begin{aligned} S_*^2 &= \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2 \\[8pt] &= \frac{1}{n} \sum_{i=1}^n (X_i^2 - 2 \bar{X} X_i + \bar{X}^2) \\[8pt] &= \frac{1}{n} \Bigg( \sum_{i=1}^n X_i^2 - 2 \bar{X} \sum_{i=1}^n X_i + n \bar{X}^2 \Bigg) \\[8pt] &= \frac{1}{n} \Bigg( \sum_{i=1}^n X_i^2 - 2 n \bar{X}^2 + n \bar{X}^2 \Bigg) \\[8pt] &= \frac{1}{n} \Bigg( \sum_{i=1}^n X_i^2 - n \bar{X}^2 \Bigg) \\[8pt] &= \frac{1}{n} \sum_{i=1}^n X_i^2 - \bar{X}^2. \end{aligned} \end{equation}$$

Taking expectations yields:

$$\begin{equation} \begin{aligned} \mathbb{E}(S_*^2) &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}(X_i^2) - \mathbb{E} (\bar{X}^2) \\[8pt] &= \frac{1}{n} \sum_{i=1}^n (\mu^2 + \sigma^2) - (\mu^2 + \frac{\sigma^2}{n}) \\[8pt] &= (\mu^2 + \sigma^2) - (\mu^2 + \frac{\sigma^2}{n}) \\[8pt] &= \sigma^2 - \frac{\sigma^2}{n} \\[8pt] &= \frac{n-1}{n} \cdot \sigma^2 \\[8pt] \end{aligned} \end{equation}$$

So you can see that the uncorrected sample variance statistic underestimates the true variance $\sigma^2$. Bessel's correction replaces the denominator with $n-1$ which yields an unbiased estimator. In regression analysis this is extended to the more general case where the estimated mean is a linear function of multiple predictors, and in this latter case, the denominator is reduced further, for the lower number of degrees-of-freedom.

Because this post nicely answers https://stats.stackexchange.com/questions/3931, perhaps it belongs in that thread rather than this one. — whuber, Apr 22 '18 at 14:35

Dividing by degrees of freedom

1 Answers1

Linked