4

In ordinary least squared there is this equation (Kevin Murphy book page 221, latest edition)

$$NLL(w)=\frac{1}{2}({y-Xw})^T(y-Xw)=\frac{1}{2}w^T(X^TX)w-w^T(X^T)y$$

I am not sure how the RHS equals the LHS. Maybe my linear algebra is weak but I can't figure out how this happens. Can somebody point out how this happens. This is related to deriving the Ordinary Least squares equation where $\hat{w}_{OLS}=(X^TX)^{-1}X^Ty$

NLL - stands for negative loglikelihood.

I am attaching the screen shots of the relevant section. My equation is in the 1st image(page 221). I actually bought the book so I am hoping that showing 2 pages is not a copyright infringement. source (Kevin Murphy, Machine Learning, A probabilistic perspective)

enter image description here enter image description here

mathopt
  • 225
  • 2
  • 8
  • 1
    Are you missing a $y^Ty$ term or does it somehow get dropped? –  Dec 11 '15 at 05:08
  • Could you tell us what NLL stands for? –  Dec 11 '15 at 05:08
  • 1
    Can you give the full reference? Note that $\frac12({y-Xw})^T(y-Xw)\neq \frac12w^T(X^TX)w-w^T(X^T)y$ but the different between them is not a function of $w$ -- so their argmin is in the same place. – Glen_b Dec 11 '15 at 07:28
  • 1
    It does not. There is an extra term $(1/2)y^Ty$ but this term is omitted since it does not depend explicitly on $w$ and it's value has no effect on the value of $w$ that minimizes $NLL(w)$. –  Dec 11 '15 at 08:13

1 Answers1

10

The sum of squares expands to:

$$ \begin{align*} \frac{1}{2}({y-Xw})^T(y-Xw) &= \frac{1}{2}\left( y^\top y - y^\top Xw - \left( X w \right) ^\top y + \left( X w \right)^\top X w \right) \\ &= \frac{1}{2}\left( y^\top y - y^\top Xw - w^\top X^\top y + w^\top X^\top X w \right) \\ &= \frac{1}{2}\left(y^{\top}y - 2 y^\top X w + w^\top X^\top X w \right) \end{align*} $$

Note that $y^\top X w$ is a 1 by 1 matrix (i.e. scalar) hence it is always symmetric and $y^\top X w = w^\top X^\top y$.

Observe that the minimization problem: $$ \text{minimize(over $w$) } \quad \frac{1}{2}y^{\top}y - y^\top X w + \frac{1}{2} w^\top X^\top X w $$ Has an equivalent solution for $w$ as: $$ \text{minimize(over $w$) } \quad \frac{1}{2}w^\top X^\top X w - y^\top X w $$ This is because the scalar $y^\top y$ doesn't affect the optimal choice of $w$. Also observe that objective is a convex function of w, hence the first order conditions are both necessary and sufficient conditions for an optimum. The first order condition is: $$ \frac{\partial \mathcal{L}}{\partial w} = \frac{1}{2}\left(X^\top X + (X^\top X)^\top \right)w - X^\top y = 0 $$ $$ X^\top X w = X^\top y$$

If $X^\top X $ is full rank (i.e. the columns of $X$ are linearly independent), $X^\top X$ can be inverted to obtain the unique solution:

$$ w = \left( X^\top X \right)^{-1} X^\top y$$

Note: For reference, matrix calculus rules for taking the partial derivative with respect $w$ can be found on Wikipedia here. Or as rightskewed pointed out, there's the classic matrix cookbook.

Matthew Gunn
  • 20,541
  • 1
  • 47
  • 85
  • Hmm the step that goes from the minimization equation to the partial derivative equation needs some heavy linear algebra manipulation (I thought it was a simple partial derivative but it isnt!)....Had to go through this video to figure out that step https://www.youtube.com/watch?v=n5QZL1B8FaE&index=56&list=PLD0F06AA0D2E8FFBA – mathopt Dec 12 '15 at 00:17
  • 1
    @user1467929 [This wikipedia page on matrix calculus](https://en.wikipedia.org/wiki/Matrix_calculus) is somewhat long and convoluted but actually a great reference. For notation purposes, the two approaches are numerator layout or denominator layout. In numerator layout: $\frac{\partial Aw}{\partial w^\top} = A$ while in denominator layout $\frac{\partial Aw}{\partial w} = A^\top$. I remember being endlessly confused by different notation back when first learning. Sometimes it's easier to write everything out as disgusting sum and then find the compact matrix notation. – Matthew Gunn Dec 12 '15 at 00:25
  • 2
    I keep [this comprehensive cookbook](http://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf) handy – rightskewed Dec 12 '15 at 00:40
  • @rightskewed that's a classic! – Matthew Gunn Dec 12 '15 at 00:43