Handling linear algebraic differentiation in OLS parameter estimation

Question

Could someone please explain why:

\begin{equation} \frac{\partial (Y-\beta^T X)^T (Y-\beta^T X)}{\partial \beta}=2X^T(Y-\beta^T X) \end{equation}

and why:

\begin{equation} \frac{\partial \lambda \beta^T \beta}{\beta}=2\lambda\beta \end{equation}

as for the latter equation, I just was not sure why we are left with: \begin{equation} 2\lambda\beta \end{equation} as opposed to \begin{equation} 2\lambda\beta^T \end{equation}

and was wondering if someone could explain with general rules how to take the derivatives of these, and any rules for the simplification would be greatly appreciated. I tried learning about certain properties of Matrix calculus, but I can't wrap my head around which properties are applied here. I would be equally content with a resource if you can point me toward one.

Is this a question from a course or textbook? If so, please add the `[self-study]` tag & read its [wiki](http://stats.stackexchange.com/tags/self-study/info). — gung - Reinstate Monica, Sep 22 '16 at 16:55
Kind of, its more to help me understand the "why" than get the answer. I'll add it though, thanks! — Ceyer Wakilpoor, Sep 22 '16 at 17:01
Just think of expressions of the form ${\bf X^\top X}$ as a square form. If they were polynomial expressions, with a quadratic, $x^2$, you would have no problem with the $2$'s, right? And also, you need the chain rule to get he first expression. Think about the fact that you are differentiating with respect to $\beta$. — Antoni Parellada, Sep 22 '16 at 17:05
Oh, that actually helps a lot, thank you. Still not sure as to how to choose between keeping the transpose or not? In the first equation we kept $2X^T$ in the front as opposed to $2X$, but then we also keep $(Y - \beta^2X)$ and $2\lambda \beta$ — Ceyer Wakilpoor, Sep 22 '16 at 17:16
In my experience, I recommend doing a toy example and working out the matrix math. The transposes will make sense then. For example, make Y a 2x1, beta a 2x1 and X a 2x2. Then work it out. — ilanman, Sep 22 '16 at 17:27
Thank you for the suggestion! Does that result in like a generalized rule, or do you suggest working that out case by case? — Ceyer Wakilpoor, Sep 22 '16 at 17:29
It will generalize. Of course, this is sufficient for your purpose of learning how it works. On a test I would prove it more formally. — ilanman, Sep 22 '16 at 17:31
This was more for my own understanding, I don't think we'd be tested on it, but that's helpful, I'll work it out and see if I can generalize it as a proof. — Ceyer Wakilpoor, Sep 22 '16 at 17:33
I haven't had a chance to iron out wrinkles in my answer below (probably tonight), but I copied and pasted some notes I keep in my GitHub. — Antoni Parellada, Sep 22 '16 at 18:31
Please clarify in the title of the question what are those equations. Also, the title is completely uninformative. Skimming the question I thought it was off-topic until I saw the tags. — Firebug, Sep 22 '16 at 18:45

Antoni Parellada · Answer 1 · 2016-09-22T17:41:42.763

Just think of expressions of the form ${\bf X^\top X}$ as a square form. If they were polynomial expressions, with a quadratic, $x^2$, you would have no problem with the $2$'s, right?

And also, you need the chain rule to get he first expression.

From my notes here:

The cost function is not necessary in OLS, but it comes into play when using regularization.

The cost function would be generally expressed as:

$J(\hat \beta)= (y - {\bf X}\hat \beta)^T(y- {\bf{X} \hat \beta})= \displaystyle \sum_{i=1}^n (y_i - x_i^T\hat \beta)^2= \sum_{i=1}^n(y_i - \hat y_i)^2$

Expanding the quadratic in matrix notation:

$$J(\hat \beta)= (y - {\bf X}\hat \beta)^T(y- {{\bf X} \hat \beta})= y^Ty + \color{red}{\hat \beta^T\,X^TX\,\hat \beta} - 2y^TX\hat \beta$$

The term in red is a positive semidefinite matrix. A positive definite matrix fulfills the requirement, $x^TAx>0$. The other two terms are scalars.

To differentiate the cost function to obtain a minimum we need two pieces of information:

$\frac{\partial {\bf A}\hat \beta}{\partial \hat \beta}={\bf A}^T$ (the derivative of a matrix with respect to a vector); and $\frac{\partial \hat \beta^T{\bf A}\hat \beta}{\partial \hat \beta}= 2{\bf A}^T\hat \beta$ (derivative of a quadratic form with respect to a vector).

$$\frac{\partial J(\hat \beta)}{\partial \hat \beta}=\frac{\partial}{\partial\hat \beta}\left[y^Ty + \color{red}{\hat \beta^T\,X^TX\,\hat \beta} - 2y^TX\hat \beta \right]=0 +2 \color{red}{X^TX\,\hat \beta}-2X^Ty$$

which gives:

$$2X^TX\hat \beta = 2X^Ty$$

$$\hat \beta = (X^TX)^{-1}X^Ty$$

score 2 · Answer 2 · edited Apr 13 '17 at 12:44

Two key points to help clear confusion:

Be careful whether they're differentiating with respect to a row vector or a column vector. (The two are basically equivalent but formulas are in some sense reversed.)
There are tons of different notations/conventions for the gradient or partial derivatives ($\nabla f$, $\frac{\partial f}{\partial \mathbf{x}'}$, $f_x$, etc...)

People can also get a bit sloppy, and sometimes a transpose is missing for a formula here or there.

Let $f(\mathbf{x})$ be a function from $\mathbb{R}^n \rightarrow \mathbb{R}$. There are two basic ways to write the gradient.

Numerator layout (i.e. result is a row vector, notice I wrote $\mathbf{x}'$): $$ \frac{\partial f}{\partial \mathbf{x}'} = \left[ \begin{array}{cccc} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \ldots & \frac{\partial f}{\partial x_n} \end{array} \right] $$ Denominator layout (i.e. result is a column vector): $$ \frac{\partial f}{\partial \mathbf{x}} = \left[ \begin{array}{cccc} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \ldots \\ \frac{\partial f}{\partial x_n} \end{array} \right] $$ Consequently:

$$ \frac{\partial \left( \boldsymbol{\beta}'\boldsymbol{\beta}\right) }{\partial \boldsymbol{\beta}'} = 2\boldsymbol{\beta}' $$ And $$ \frac{\partial \left( \boldsymbol{\beta}'\boldsymbol{\beta}\right) }{\partial \boldsymbol{\beta}} = 2\boldsymbol{\beta} $$

And if you get confused, feel free to write things out! Eg. $ \boldsymbol{\beta}'\boldsymbol{\beta} = \sum_i \beta_i^2 $ and it should be fairly straightforward to compute the gradient of that. A perfectly legitimate activity is to write out things explicitly by hand, afterwards figure out how to write it compactly using matrices, secretly destroy your lengthy derivation, and only publish the compact matrix notation, pretend you're the absolute master of vector calculus and matrix calculus identities.

You may also find this answer helpful to check your work going through the matrix algebra and matrix calculus: Understanding linear algebra in Ordinary Least Squares derivation

now that i've gotten the hang of this months later, this comment is amusingly accurate — Ceyer Wakilpoor, May 30 '18 at 00:53

Handling linear algebraic differentiation in OLS parameter estimation

2 Answers2

Linked