In the context of a linear regression, say
\begin{align} y_{i} & = \beta_{0} + \beta_{1}x_{1i} + \beta_{2}x_{2i} + \ldots + \beta_{k}x_{ki} + \epsilon_{i} \end{align}
the F-test is
\begin{align} F & = \frac{\sum_{i=1}^{n}(\hat{y}_{i} - \bar{y})^{2}/k}{\sum_{i=1}^{n}(y_{i} - \hat{y})^{2}/(n - k - 1)} = \frac{ESS/k}{RSS/(n - k - 1)} \end{align}
where $ESS$ is the 'explained sum of squares' and $RSS$ is the 'residual sum of squares'. The degrees of freedom in the numerator are $df_1 = k$ and in the denominator $df_2 = (n - k - 1)$ where $k$ is the number of predictors.
My questions are:
- My understanding is the F-distribution is supposed to be the result of the ratio between two chi-square distributed variables each divided by their degrees of freedom. How are $ESS$ and $RSS$ chi-square distributed? I thought chi-square results from the sum of squares of a standard normal variable, $\sim N(0, 1)$. I see the sum of squares part, but why, for example, is $(\hat{y}_{i} - \bar{y})$ standard normal?
- Where do the degrees of freedom come from? It seems arbitrary to me to be dividing by $k$ in the numerator and $n - k - 1$ in the denominator. Is there some intuition behind why we are typically dividing the numerator by a relatively small number and the denominator by a much larger number (assuming in most regression models $n \gg k$)? Is it because we don't need to know each $y_{i}$ in the $ESS$, just the $k$ coefficients which result in the deterministic regression line, $\hat{y}_{i}$? Then I suppose the pieces of information that go into the $ESS$ is much less than the $RSS$ but I'm not sure if I'm even on the right track with that line of reasoning.
Some posts already touch on this (e.g., Formation of the test statistic in one-way ANOVA, Why use the F distribution and F test?, F-test and F-distribution), but I haven't seen these questions answered yet in a way that I'm able to understand.