Understanding linear algebra in Ordinary Least Squares derivation

Question

In ordinary least squared there is this equation (Kevin Murphy book page 221, latest edition)

$$NLL(w)=\frac{1}{2}({y-Xw})^T(y-Xw)=\frac{1}{2}w^T(X^TX)w-w^T(X^T)y$$

I am not sure how the RHS equals the LHS. Maybe my linear algebra is weak but I can't figure out how this happens. Can somebody point out how this happens. This is related to deriving the Ordinary Least squares equation where $\hat{w}_{OLS}=(X^TX)^{-1}X^Ty$

NLL - stands for negative loglikelihood.

I am attaching the screen shots of the relevant section. My equation is in the 1st image(page 221). I actually bought the book so I am hoping that showing 2 pages is not a copyright infringement. source (Kevin Murphy, Machine Learning, A probabilistic perspective)

Are you missing a $y^Ty$ term or does it somehow get dropped? — , Dec 11 '15 at 05:08
Can you give the full reference? Note that $\frac12({y-Xw})^T(y-Xw)\neq \frac12w^T(X^TX)w-w^T(X^T)y$ but the different between them is not a function of $w$ -- so their argmin is in the same place. — Glen_b, Dec 11 '15 at 07:28
It does not. There is an extra term $(1/2)y^Ty$ but this term is omitted since it does not depend explicitly on $w$ and it's value has no effect on the value of $w$ that minimizes $NLL(w)$. — , Dec 11 '15 at 08:13

Matthew Gunn · Accepted Answer · 2017-07-26T19:22:55.680

The sum of squares expands to:

$$ \begin{align*} \frac{1}{2}({y-Xw})^T(y-Xw) &= \frac{1}{2}\left( y^\top y - y^\top Xw - \left( X w \right) ^\top y + \left( X w \right)^\top X w \right) \\ &= \frac{1}{2}\left( y^\top y - y^\top Xw - w^\top X^\top y + w^\top X^\top X w \right) \\ &= \frac{1}{2}\left(y^{\top}y - 2 y^\top X w + w^\top X^\top X w \right) \end{align*} $$

Note that $y^\top X w$ is a 1 by 1 matrix (i.e. scalar) hence it is always symmetric and $y^\top X w = w^\top X^\top y$.

Observe that the minimization problem: $$ \text{minimize(over $w$) } \quad \frac{1}{2}y^{\top}y - y^\top X w + \frac{1}{2} w^\top X^\top X w $$ Has an equivalent solution for $w$ as: $$ \text{minimize(over $w$) } \quad \frac{1}{2}w^\top X^\top X w - y^\top X w $$ This is because the scalar $y^\top y$ doesn't affect the optimal choice of $w$. Also observe that objective is a convex function of w, hence the first order conditions are both necessary and sufficient conditions for an optimum. The first order condition is: $$ \frac{\partial \mathcal{L}}{\partial w} = \frac{1}{2}\left(X^\top X + (X^\top X)^\top \right)w - X^\top y = 0 $$ $$ X^\top X w = X^\top y$$

If $X^\top X $ is full rank (i.e. the columns of $X$ are linearly independent), $X^\top X$ can be inverted to obtain the unique solution:

$$ w = \left( X^\top X \right)^{-1} X^\top y$$

Note: For reference, matrix calculus rules for taking the partial derivative with respect $w$ can be found on Wikipedia here. Or as rightskewed pointed out, there's the classic matrix cookbook.

Hmm the step that goes from the minimization equation to the partial derivative equation needs some heavy linear algebra manipulation (I thought it was a simple partial derivative but it isnt!)....Had to go through this video to figure out that step https://www.youtube.com/watch?v=n5QZL1B8FaE&index=56&list=PLD0F06AA0D2E8FFBA — mathopt, Dec 12 '15 at 00:17
@user1467929 [This wikipedia page on matrix calculus](https://en.wikipedia.org/wiki/Matrix_calculus) is somewhat long and convoluted but actually a great reference. For notation purposes, the two approaches are numerator layout or denominator layout. In numerator layout: $\frac{\partial Aw}{\partial w^\top} = A$ while in denominator layout $\frac{\partial Aw}{\partial w} = A^\top$. I remember being endlessly confused by different notation back when first learning. Sometimes it's easier to write everything out as disgusting sum and then find the compact matrix notation. — Matthew Gunn, Dec 12 '15 at 00:25
I keep [this comprehensive cookbook](http://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf) handy — rightskewed, Dec 12 '15 at 00:40

Understanding linear algebra in Ordinary Least Squares derivation

1 Answers1

Linked

Related