Differentiating $ (y-X\beta)^T(y - X \beta) $ with respect to $\beta$

Question

How do I differentiate $$ (y-X\beta)^T(y - X \beta) $$

with respect to $\beta$. The result I saw was

$$X^T(y - X\beta)$$

@Sycorax I was trying to use the product rule. I don't know how to different the first part with the exponential T. — EA Lehn, Sep 02 '20 at 18:08
Are you sure $^T$ is an exponent, instead of notation for transpose? Typically, this expression, with these symbols, arises in a regression setting, and is the dot-product of the error $y - X\beta$ with itself, i.e. square error. — Sycorax, Sep 02 '20 at 18:10
@Sycorax Yes it is a transpose. or should I differentiate it using this $(y - X\beta)^2$ — EA Lehn, Sep 02 '20 at 18:14
@EALehn as sycorax pointed out this problem is defined in matrix notation which has completely different calculus/algebra rules, so you square a term by multiplying it with it's transpose since we must have the length of matrix 1 equal to the number of columns in matrix 2. The solution to the problem is beta = (X'X)^-1X'y which is the normal equation for a linear regression. — Tylerr, Sep 02 '20 at 18:18
@Tylerr "Solution" in what sense? What you've written is not the derivative of the expression with respect to $\beta$. — Sycorax, Sep 02 '20 at 18:25
@sycorax this is a regression problem, we differentiate the squared error with regard to beta and set it to 0 and solve for beta. So the first equation is equal to the squared error vector. After differentiating and setting to 0 and solving for beta we get the normal equation: https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression/ — Tylerr, Sep 02 '20 at 18:40
@Tylerr That seems to be the answer to a question that OP didn't ask. — Sycorax, Sep 02 '20 at 18:52
@EA Lehn here is an ok walkthrough: https://towardsdatascience.com/normal-equation-a-matrix-approach-to-linear-regression-4162ee170243 definitely look up matrix calculus rules and algebra rules to follow along. It is actually pretty easy once you grasp the matrix rules, just expand your first equation and take the derivative and solve for beta! Couldn't find one that goes through every single step individually although I am sure one exists. Let me know if you need more help. — Tylerr, Sep 02 '20 at 18:53
@Tylerr I get it now. I went through https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression/ — EA Lehn, Sep 02 '20 at 18:55
I'm with @Sycorax.. that's a completely different problem to solve. — Learning stats by example, Sep 02 '20 at 19:19

score 4 · Accepted Answer · answered Sep 02 '20 at 19:28

Let us assume that you are working in a setup where $y$ is $N \times 1$ and $X$ is $N \times K$ and $\beta$ is $K \times 1$. I prefer to define $e(\beta) := (y - X\beta)$ and similarly the $i$'th component $e_{i}(\beta) = (y - X\beta)_i = y_i - x_i^\top\beta$ where $x_i^\top$ is the $i$'th row of $X$. You should then be able to convince yourself that

$$e(\beta)^\top e(\beta) = \sum_i e_{i}(\beta) e_{i}(\beta),$$

the sum of squared deviations. Now I guess you know how to differentiate with respect to a single variable (read parameter) $\beta_k$ so lets try this

$$\frac{\partial}{\partial \beta_k} e(\beta)^\top e(\beta) = \sum_i\frac{\partial}{\partial \beta_k} [e_{i}(\beta) e_{i}(\beta)],$$

apply the product rule to get

$$= \sum_i \frac{\partial e_i(\beta)}{\partial \beta_k} e_i(\beta) + e_i(\beta) \frac{\partial e_i(\beta)}{\partial \beta_k} = 2 \sum_i \frac{\partial e_i(\beta)}{\partial \beta_k} e_i(\beta),$$

where the final sum here can be written in matrix/vector notation as

$$= 2 \left[\frac{\partial e(\beta)^\top}{\partial \beta_k}\right] e(\beta),$$

all the same derivations can be done differentiating with respect to a column $\beta$ observing the rule that when you differentiate with respect to a column you get a column so

$$\frac{\partial e_i(\beta)}{\partial \beta} = \begin{pmatrix} \frac{\partial e_i(\beta)}{\partial \beta_1}\\ \vdots \\ \frac{\partial e_i(\beta)}{\partial \beta_K} \end{pmatrix}$$

you should then be able to get the rule that

$$\frac{\partial}{\partial \beta} e(\beta)^\top e(\beta) = 2 \left[\frac{\partial e(\beta)^\top}{\partial \beta}\right] e(\beta),$$

so let figure out what $\frac{\partial e(\beta)^\top}{\partial \beta}$ for which we get

$$\frac{\partial e(\beta)^\top}{\partial \beta} = \frac{\partial}{\partial \beta} (e_1(\beta),...,e_N(\beta)) = \left( \frac{\partial e_1(\beta)}{\partial \beta},..., \frac{\partial e_N(\beta)}{\partial \beta}\right)$$ and for each $i$ you have that $\frac{\partial e_{i}(\beta)}{\partial \beta} = -x_i$ so then it is easy to see that $$\frac{\partial e(\beta)^\top}{\partial \beta} = - X^\top$$ and it follows that

$$\frac{\partial}{\partial \beta} e(\beta)^\top e(\beta) = - 2X^\top (y - X\beta).$$

In a context where the writer knows he or she wants to solve $- 2X^\top (y - X\beta) = 0$ he or she may go directly from $$\frac{\partial}{\partial \beta} e(\beta)^\top e(\beta) = 0$$ to $X^\top (y - X\beta) = 0$ leading you to think that the author is implicitly claiming that $$\frac{\partial}{\partial \beta} e(\beta)^\top e(\beta)= X^\top (y - X\beta) $$ which is not the case.

Differentiating $ (y-X\beta)^T(y - X \beta) $ with respect to $\beta$

1 Answers1