Where do the degrees of freedom in the F-test come from?

Question

In the context of a linear regression, say

\begin{align} y_{i} & = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} + \ldots + \beta_{k}x_{ki} + \epsilon_{i} \end{align}

the F-test is

\begin{align} F & = \frac{\sum_{i=1}^{n}(\hat{y}_{i} - \bar{y})^{2}/k}{\sum_{i=1}^{n}(y_{i} - \hat{y})^{2}/(n - k - 1)} = \frac{ESS/k}{RSS/(n - k - 1)} \end{align}

where $ESS$ is the 'explained sum of squares' and $RSS$ is the 'residual sum of squares'. The degrees of freedom in the numerator are $df_1 = k$ and in the denominator $df_2 = (n - k - 1)$ where $k$ is the number of predictors.

My questions are:

My understanding is the F-distribution is supposed to be the result of the ratio between two chi-square distributed variables each divided by their degrees of freedom. How are $ESS$ and $RSS$ chi-square distributed? I thought chi-square results from the sum of squares of a standard normal variable, $\sim N(0, 1)$. I see the sum of squares part, but why, for example, is $(\hat{y}_{i} - \bar{y})$ standard normal?
Where do the degrees of freedom come from? It seems arbitrary to me to be dividing by $k$ in the numerator and $n - k - 1$ in the denominator. Is there some intuition behind why we are typically dividing the numerator by a relatively small number and the denominator by a much larger number (assuming in most regression models $n \gg k$)? Is it because we don't need to know each $y_{i}$ in the $ESS$, just the $k$ coefficients which result in the deterministic regression line, $\hat{y}_{i}$? Then I suppose the pieces of information that go into the $ESS$ is much less than the $RSS$ but I'm not sure if I'm even on the right track with that line of reasoning.

Some posts already touch on this (e.g., Formation of the test statistic in one-way ANOVA, Why use the F distribution and F test?, F-test and F-distribution), but I haven't seen these questions answered yet in a way that I'm able to understand.

Does this help? https://stats.stackexchange.com/questions/258461/proof-that-f-statistic-follows-f-distribution — Christoph Hanck, Oct 26 '21 at 10:30
$(\hat{y}_{i} - \bar{y})$ is not standard normal, but the model suggests it should be normally distributed with zero mean if using ordinary linear regression since the $\epsilon_{i}$ are assumed to be iid normal with zero mean. So there is a scaling factor, but this cancels out when taking the ratio — Henry, Oct 26 '21 at 10:36
Thanks for the reply, Henry. I did read about a non-central chi-square distribution as the result of the sum of squares of a normally distributed variable. But I don't understand why the scaling factor should cancel out (sorry, maybe I haven't thought about it hard enough). We could turn (yhat - ybar) into a standard normal variable by dividing by sd(yhat), I think. But to do the same to the denominator we would have to divide by sd(epsilon), the sd of the error. So, unless sd(yhat) = sd(epsilon), they won't cancel out? Or is it because under the null hypothesis, sd(yhat) = sd(epsilon)? — hendogg87, Oct 26 '21 at 15:27
Thanks as well, Christoph. I came across that post but I got lost at some point while reading it! I will have another look, though! — hendogg87, Oct 26 '21 at 15:29

Where do the degrees of freedom in the F-test come from?

0 Answers0