I'm reading the The Elements of Statistical Learning and come across the following:
Can anyone explain how they moved from 3.3 to 3.6 or even to 3.4 ?
I'm reading the The Elements of Statistical Learning and come across the following:
Can anyone explain how they moved from 3.3 to 3.6 or even to 3.4 ?
Our loss function is $RSS(\beta) = (y - X\beta)^T(y -X\beta)$. Expanding this and using the fact that $(u - v)^T = u^T - v^T$, we have $$ RSS(\beta) = y^Ty - y^TX\beta - \beta^TX^Ty + \beta^T X^T X \beta. $$ Noting that $y^TX\beta$ is a scalar, and for any scalar $r \in \mathbb R$ we have $r = r^T$ we have $y^T X \beta = (y^T X \beta)^T = \beta^T X^T y$ so all together $$ RSS(\beta) = y^T y - 2 \beta^T X^T y + \beta^T X^T X \beta. $$
Now we'll differentiate with respect to $\beta$: $$ \frac{\partial RSS}{\partial \beta} = \frac{\partial}{\partial \beta} y^T y - 2 \frac{\partial}{\partial \beta} \beta^T X^T y + \frac{\partial}{\partial \beta} \beta^T X^T X \beta $$ $$ = 0 - 2 X^T y + 2 X^T X \beta. $$ If you haven't seen derivatives with respect to a vector before, the Matrix Cookbook is a popular reference.
We want to find the minimum of $RSS$ so we'll set the derivative equal to $0$. This leads us to $$ \frac{\partial RSS}{\partial \beta} \stackrel{\text{set}}= 0 \implies -2X^T y + 2X^T X \beta = 0 $$ $$ \implies X^T y - X^T X \beta = 0 \implies X^T(y - X \beta) = 0. $$
Now we do use the assumption that $X$ is full column rank, which means we know $X^T X$ is positive definite and therefore invertible. This means $$ X^Ty = X^T X \beta \implies (X^T X)^{-1}X^T y = \hat \beta $$ where we achieved this by left-multiplying by $(X^T X)^{-1}$.