I am going through the book The Elements of Statistical Learning and I'm finding it extremely terse. I have a background in probability but not statistics so perhaps that is why. Anyway, in Chapter 3, section 3.2 (p.47), they discuss the unbiased estimator of the variance of the $y_i$. They write,
Typically one estimates the variance $\sigma^2$ by $$ \hat{\sigma}^2 = \frac{1}{N − p − 1} \sum_{i = 1}^N ( y_i − \hat{y}_i)^2.$$ The $N − p − 1$ rather than $N$ in the denominator makes $\hat{\sigma}^2$ an unbiased estimate of $\sigma^2$.
Here the $y_i$ are the observations which we assume are distributed like $ x_i^T \beta + \zeta_i$, where $x_i$ are the inputs, $\zeta_i$ are independent of everything, mean zero, variance $\sigma^2$, and $\beta$ is fixed. $\hat{y}_i$ are the estimated output from the linear regression we have constructed. Ie., $\hat{y}_i = x_i^T \hat{\beta}$ with $\hat{\beta}$ the linear multiplier, $\hat{\beta} = (X^T X)^{-1}X^T y$.
They make this comment without any explanation. Can someone explain why this is true. And even better, why this should be obvious.
Edit: Thanks to @gung for the link but I found that calculation still unclear and missing details. I'm putting down my calculation here just for completeness.
Some notation: Let $X$ be the $N \times (p+1)$ matrix of samples by characteristics (the extra $1$ is for the column of all ones!). Let $y = (y_1,...,y_N)$ and $\hat{y} = (\hat{y}_1,...,\hat{y}_N)$.
Under these conditions, $\hat{y} = X \hat{\beta}$ and $y = X \beta + \zeta$ where $\zeta$ is an $n\times 1$ vector of independent Gaussians with common variance $\sigma^2$. Using our formula for $\hat{\beta}$ above this gives $$ \begin{align} \hat{\beta} = (X^TX)^{-1}X^T y = \beta + (X^TX)^{-1}X^T \zeta \\ \end{align} $$ And so, $$ \begin{align} y - \hat{y} & = X\beta + \zeta - (X\beta + X(X^TX)^{-1}X^T \zeta) \\ & = (I_N - X(X^TX)^{-1}X^T) \zeta \\ & := (I_N - B) \zeta, \end{align} $$ with $B:= X(X^TX)^{-1}X^T$. Therefore, $$ \begin{align} \sum_{i=1}^N \mathbb{E} (y_i - \hat{y}_i)^2 & = \mathbb{E} \left[(y-\hat{y})^T(y-\hat{y}) \right] \\ & =\mathbb{E} \left[ \zeta^T (I_N-B)^T (I_N-B) \zeta \right] \\ & = \sigma^2 Tr((I_N-B)^T(I_N- B)) . \end{align} $$ The last equality follows because the $\zeta$ are independent with common variance $\sigma^2$. Finally, $$ \begin{align} (I_N-B)^T(I_N-B) &= I_N - X(X^TX)^{-1}X^T. \\ \end{align} $$ Therefore, $$ \begin{align} Tr((I_N-B)^T(I_N-B)) & = Tr(I_N) - Tr(X(X^TX)^{-1}X^T) \\ & = Tr(I_N) - Tr(I_{p+1}) \\ & =N-p-1 \end{align} $$