Why is this statistic F-distributed?

Question

A book I'm reading claims that the statistic:

$\frac{(RSS_0 - RSS_1) / (p_1 - p_0)}{RSS_1 / (N - p_1 - 1)}$ has an F distribution. Why is this? I know that an F distribution is something like $\frac{\chi^2_p / p}{\chi^2_q / q}$, where the two chi-square distributions are independent, but I fail to see why $RSS_0 - RSS_1$ is chi-squared and also why $RSS_0 - RSS_1$ and $RSS_1$ are independent.

For some context, $RSS_1$ is the $RSS$ of a least squares model with $p_1 + 1$ parameters, and $RSS_0$ is a smaller model with $p_1 - p_0$ of the parameters in the first model set to 0.

One of the main assumptions of the OLS model is that the error term is normally distributed. As you might know the square of a normal random variable is distributed as a Chi square. The sum of Chi squares is once again Chi square — Chaos, Mar 27 '19 at 21:36
@RScrlli the sum of chi squares is chi square, but I don't think the difference is. — serendipity, Mar 27 '19 at 21:59
In this case, the difference can be expressed as a sum of squares. — whuber, Mar 27 '19 at 22:26
@whuber I don't see how to do that, since the two regressions will give completely different residuals. — serendipity, Mar 27 '19 at 23:37
@serendipity Difference of (independent) chi-square variables is definitely not another chi-square. But this is not the setup here. — StubbornAtom, Mar 28 '19 at 10:30
And I think your result follows from a theorem on quadratic forms related to Fisher-Cochran theorem. — StubbornAtom, Mar 28 '19 at 11:16

Aksakal · Answer 1 · 2021-02-06T23:01:47.213

You got the intuition right about F distribution being the ratio of one $\chi^2$ number over the other. The puzzle is solved easily if you start from the more general case of $q$ linear restrictions of the form $R\beta$ imposed on the coefficients $\beta$ where $R$ is the $q\times(p+1)$ matrix for $p$ bona fide coefficients (excluding the intercept). If you assume that the errors are normal and satisfy other Gauss-Markov conditions then when you look at the restrictions $R\hat\beta$ the follwing quadratic form ends up being a variable from F-distrobution: $$\frac{(R\hat\beta)'[Rs^2(X'X)^{-1}R']^{-1}(R\hat\beta)}{q}\sim F_{q,n-(p+1)}$$ It's easy to see why it's F-distribution because the numerator ($\sim\hat\beta'\hat\beta$) and denominator ($\sim s^2(X'X)^{-1}$) are $\chi^2$.

Next, you can show that the nested model testing will lead to the simple expression that you brought up with residual sums of squares. The restrictions for relevant variable test such as your look as follows: $R=[0,\dots,1,\dots,0]$ where non-zero elements are for those variables that are in unrestricted variable but not in restricted. You have $q$ rows in $R$ matrix, corresponding to $q$ variables that you are testing. You plug this $R$ into the expression and work out the math, which is simple but long and tedious.

In machine learning context almost certainly you have a lot of data, so you can go for even simpler version of the test like $nR^2\sim\chi^2_q$

Sextus Empiricus · Answer 2 · 2021-02-06T22:19:41.137

Geometric intuitive view

You can view the observation $y_1,y_2,y_3, ... , y_n$ as being partitioned into two parts. These two parts are components in two ortogonal subspaces.

One part is the fitted point $\hat y_1, \hat y_2,\hat y_3, ... , \hat y_n$. The fitted point is in the plane that is spanned by the vectors $x$.
The other part is the residuals $\epsilon_1, \epsilon_2, \epsilon_3, ..., \epsilon_n$. The residual is in the complement space.

In this view of two subspaces, you can see the distribution of $y_1, ... , y_n$ as a speherical symmetric n-dimensional multivariate normal distribution, which can be split up into $d$ independent normal vectors in the plane of the fit, and $n-d$ independent normal vectors in the plane of the residual space. These two parts are independent from each other and the squared distance of this vector, also the RSS, is a chi-squared variable.

The above explains the partition of the model and the residual. You can extend this to multiple models when these are nested. For instance, in the above image

the plane is the space for the model as a sum of two vectors $y_{fit 2} = a x_1 + b x_2$
and the black line inside that space is the model $y_{fit 1} = a x_1$.

If $y_{fit 1}$ is inside the space of $y_{fit 2}$ then you could find the fit $y_{fit 1}$ by first fitting $y_{fit 2}$ and then treat that fit as the observation to find $y_{fit 1}$ instead.

The difference between $y_{fit 1}$ and $y_{fit 2}$ can be seen as an additional residual. The difference in the RSS from the two models is the residual from fitting $y_{fit 1}$ starting with $y_{fit 2}$ and this is multivariate normal distributed.

This view of a multivariate normal distribution that can be split up into separate subparts, which are lower dimensional multivariate normal distributions, holds when the true model, the true population mean, is actually inside the space of the model.

The statistic is only F-distributed when the null hypothesis is true.

it's worth pointing out that if the errors are not normal, then asymptotically $qF\sim\chi^2_q$, where $F$ is the F-stat that OP is looking at. — Aksakal, Feb 06 '21 at 23:00
In addition, when the surface of the model is not linear (e.g. non-linear least squares fitting) then asymptotically one approaches a chi-squared as well. — Sextus Empiricus, Feb 06 '21 at 23:03

Why is this statistic F-distributed?

2 Answers2

Geometric intuitive view

Linked