Derivation of the closed-form solution to minimizing the least-squares cost function

Question

I'm reading the The Elements of Statistical Learning and come across the following:

Can anyone explain how they moved from 3.3 to 3.6 or even to 3.4 ?

Maybe this page will help you for derivatives involving matrices and vectors: https://en.wikipedia.org/wiki/Matrix_calculus#Scalar-by-vector_identities — Carlos Campos, Mar 26 '18 at 18:04

score 11 · Accepted Answer · edited Nov 08 '21 at 01:32

Our loss function is $RSS(\beta) = (y - X\beta)^T(y -X\beta)$. Expanding this and using the fact that $(u - v)^T = u^T - v^T$, we have $$ RSS(\beta) = y^Ty - y^TX\beta - \beta^TX^Ty + \beta^T X^T X \beta. $$ Noting that $y^TX\beta$ is a scalar, and for any scalar $r \in \mathbb R$ we have $r = r^T$ we have $y^T X \beta = (y^T X \beta)^T = \beta^T X^T y$ so all together $$ RSS(\beta) = y^T y - 2 \beta^T X^T y + \beta^T X^T X \beta. $$

Now we'll differentiate with respect to $\beta$: $$ \frac{\partial RSS}{\partial \beta} = \frac{\partial}{\partial \beta} y^T y - 2 \frac{\partial}{\partial \beta} \beta^T X^T y + \frac{\partial}{\partial \beta} \beta^T X^T X \beta $$ $$ = 0 - 2 X^T y + 2 X^T X \beta. $$ If you haven't seen derivatives with respect to a vector before, the Matrix Cookbook is a popular reference.

We want to find the minimum of $RSS$ so we'll set the derivative equal to $0$. This leads us to $$ \frac{\partial RSS}{\partial \beta} \stackrel{\text{set}}= 0 \implies -2X^T y + 2X^T X \beta = 0 $$ $$ \implies X^T y - X^T X \beta = 0 \implies X^T(y - X \beta) = 0. $$

Now we do use the assumption that $X$ is full column rank, which means we know $X^T X$ is positive definite and therefore invertible. This means $$ X^Ty = X^T X \beta \implies (X^T X)^{-1}X^T y = \hat \beta $$ where we achieved this by left-multiplying by $(X^T X)^{-1}$.

See [this answer](https://math.stackexchange.com/a/691836/579869) for a proof that $X$ is full column rank iff $X^T X$ is invertible. Also, I think it should be noted that if $X$ is not full column rank, we can remove columns to make it full column rank, as explained [here](https://stats.stackexchange.com/a/363874/215801). — Oren Milman, Aug 25 '18 at 06:22

Derivation of the closed-form solution to minimizing the least-squares cost function

1 Answers1