1

Multiplying sample variance (i.e. variance from sample mean) by $\frac{n}{n-1}$ to obtain an unbiased estimate of the population variance (i.e. variance from population mean) is called Bessel's correction.

We can think of the "mean" as a one-parameter model of the population and variance as the sum of the square errors (which is also the negative log likelihood in the case of normally distributed error) relative to the estimated model. Bessel's correction makes intuitive sense as a compensation for overfitting the model to the data. After all, the model will always fit a single data point perfectly since it has a single degree of freedom (or in the case of a vector quantity, as many degrees of freedom as the data point itself). So we can say nothing about this model's error until the number of data points exceeds 1.

Generalizing this model to a linear (affine) function introduces an additional degree of freedom, allowing it to fit an additional data point (in addition to the first data point) perfectly, as a line can intersect two points. Does it then make sense to apply a correction of $\frac{n}{n-2}$ to account for the greater overfitting? After all, we can say nothing about this model's error until the number of data points exceeds 2.

In general, if there are $k$ model parameters and we are interested in the expected square error (from the model's prediction), does it make sense to apply a correction of $\frac{n}{n-k}$ to the sample's square error (from the model's prediction)?

Would this by any chance be consistent with one of the "information criteria" (BIC, AIC, ...)?

EDIT: To clarify the last sentence:

Given a set of $N$ data points $\{(x_i\in \mathbb{R}^n,y_i\in\mathbb{R})\}$ and a set of models $f_j:\mathbb{R}^n\mapsto\mathbb{R}$ let $S_{f_j}=\sum(y_i-f_j(x_i))^2$ be a quantity minimized by $f_j$, which has $M_j$ parameters (i.e. $f_j$ was optimized in a $M_j$-dimensional space.

For any predetermined model $f$, we could estimate the distribution of error from the model $(y-f(x))$ as a normal distribution $Normal(0,S_{f}/N)$ and then we can evaluate it with the Bayesian information criterion, Aikake information criterion etc. This way, we can select the best model $f_{j^*}$ from $\{f_j\}$. I'm wondering whether I could reasonably just select the model $f_{j^*}$ that minimizes $S_{f_j}/(N-M_j)$ and whether this selection would be consistent with any of BIC, AIC, etc.

Museful
  • 365
  • 2
  • 10
  • 1
    The essentials of this seem to be discussed in numerous posts on site, e.g. https://stats.stackexchange.com/questions/277009/why-are-the-degrees-of-freedom-for-multiple-regression-n-k-1-for-linear-reg - though there's probably better ones. – Glen_b Jan 27 '20 at 05:38
  • @Glen_b-ReinstateMonica Thank you for the link but there I see only counting of degrees of freedom. I don't see anything related to the essence of my question which is compensation for overfitting / model selection criteria. – Museful Jan 27 '20 at 06:10
  • But degrees of freedom is *precisely* the concept you are asking about! – whuber Jan 27 '20 at 15:47
  • Halmos, P. 1946. The Theory of Unbiased Estimation. _Annals of Mathematical Statistics_ 17(1), 34-43www.jstor.org/stable/2235902 may help. – Nick Cox Jan 27 '20 at 16:06
  • @whuber I'm surely missing a lot of background knowledge, but in my mind I had asked a yes/no question, and reading those answers I fail to arrive at a yes or at a no. Is it because the answer is more complicated than a yes/no, or is it because it is obviously yes and in statistics this is implicit in the term "degrees of freedom"? – Museful Jan 28 '20 at 01:06
  • There are several problems with the final question as posed. First, yes/no questions aren't suitable for this site; second, it's vague: what exactly do you mean by "consistent with"? For this reason I, and I suspect all other readers, have ignored your last sentence and focused on the earlier question, "does it make sense to apply a correction..." That's extensively discussed in the duplicate threads. – whuber Jan 28 '20 at 13:54
  • @whuber "consistent with" == "it selects the same model as"=="it ranks models in the same order as" for some reasonable/typical relationship between likelihoods and residuals (e.g. normal distribution). – Museful Jan 28 '20 at 14:02
  • I get the thrust of your interest but I still don't know what you're referring to by "it." Evidently you have in mind some kind of procedure based on some kind of statistic that itself involves an estimate of a variance, but that's as far as I can take the inference. – whuber Jan 28 '20 at 14:04

0 Answers0