13

In Andrew Ng's machine learning course, he uses this formula:

$\nabla_A tr(ABA^TC) = CAB + C^TAB^T$

and he does a quick proof which is shown below:

$\nabla_A tr(ABA^TC) \\ = \nabla_A tr(f(A)A^TC) \\ = \nabla_{\circ} tr(f(\circ)A^TC) + \nabla_{\circ}tr(f(A)\circ^T C)\\ =(A^TC)^Tf'(\circ) + (\nabla_{\circ^T}tr(f(A)\circ^T C)^T \\ = C^TAB^T + (\nabla_{\circ^T}tr(\circ^T)Cf(A))^T \\ =C^TAB^T + ((Cf(A))^T)^T \\ = C^TAB^T + CAB$

The proof seems very dense without any comments and I'm having trouble understanding it. What exactly happened from second to third equality?

whuber
  • 281,159
  • 54
  • 637
  • 1,101
MoneyBall
  • 737
  • 4
  • 15
  • He must be making special assumptions about the dimensions of $A$, $B$, and $C$, for otherwise this formula makes no sense in general. On the left hand side $A$ must be an $i\times j$ matrix, $B$ a $j\times j$ matrix, and $C$ an $i\times m$ matrix for arbitrary non-negative integers $i,j,m$. But then the products on the right would not be defined unless $i=m$. – whuber Jan 22 '17 at 17:07
  • @whuber I see. Given the assumptions, I still don't understand how the transition happened from second to third line where he introduces $\circ$. – MoneyBall Jan 22 '17 at 17:18
  • Between the second and third line he's let $f(A)=AB$. Between the second and third line he's used the product rule. later he uses the chain rule to get rid of $f()$. – Brian Borchers Jan 22 '17 at 23:37

1 Answers1

17

There is a subtle but heavy abuse of the notation that renders many of the steps confusing. Let's address this issue by going back to the definitions of matrix multiplication, transposition, traces, and derivatives. For those wishing to omit the explanations, just jump to the last section "Putting It All Together" to see how short and simple a rigorous demonstration can be.


Notation and Concepts

Dimensions

For the expression $ABA^\prime C$ to make sense when $A$ is an $m\times n$ matrix, $B$ must be a (square) $n\times n$ matrix and $C$ must be an $m\times p$ matrix, whence the product is an $m\times p$ matrix. In order to take the trace (which is the sum of diagonal elements, $\operatorname{Tr}(X)=\sum_i X_{ii}$), then $p=m$, making $C$ a square matrix.

Derivatives

The notation "$\nabla_A$" appears to refer to the derivative of an expression with respect to $A$. Ordinarily, differentiation is an operation performed on functions $f:\mathbb{R}^N\to\mathbb{R}^M$. The derivative at a point $x\in \mathbb{R}^N$ is a linear transformation $Df(x):\mathbb{R}^N\to\mathbb{R}^M$. Upon choosing bases for these vector spaces, such a transformation can be represented as an $M\times N$ matrix. That is not the case here!

Matrices as vectors

Instead, $A$ is being considered as an element of $\mathbb{R}^{mn}$: its coefficients are being unrolled (usually either row by row or column by column) into a vector of length $N=mn$. The function $f(A)=\operatorname{Tr}(ABA^\prime C)$ has real values, whence $M=1$. Consequently, $Df(x)$ must be a $1\times mn$ matrix: it's a row vector representing a linear form on $\mathbb{R}^{mn}$. Howver, the calculations in the question use a different way of representing linear forms: their coefficients are rolled back up into $m\times n$ matrices.

The trace as a linear form

Let $\omega$ be a constant $m\times n$ matrix. Then, by definition of the trace and of matrix multiplication,

$$\eqalign{ \operatorname{Tr}(A\omega^\prime) &= \sum_{i=1}^m(A\omega^\prime)_{ii} = \sum_{i=1}^m\left(\sum_{j=1}^n A_{ij}(\omega^\prime)_{ji}\right) = \sum_{i,j} \omega_{ij}A_{ij} }$$

This expresses the most general possible linear combination of the coefficients of $A$: $\omega$ is a matrix of the same shape as $A$ and its coefficient in row $i$ and column $j$ is the coefficient of $A_{ij}$ in the linear combination. Because $\omega_{ij}A_{ij}=A_{ij}\omega_{ij}$, the roles of $\omega$ and $A$ may switched, giving the equivalent expression

$$\sum_{i,j} \omega_{ij}A_{ij} = \operatorname{Tr}(A\omega^\prime) = \operatorname{Tr}(\omega A^\prime).\tag{1}$$

By identifying a constant matrix $\omega$ with either of the functions $A\to \operatorname{Tr}(A \omega^\prime)$ or $A\to \operatorname{Tr}(\omega A^\prime)$, we may represent linear forms on the space of $m\times n$ matrices as $m\times n$ matrices. (Do not confuse these with derivatives of functions from $\mathbb{R}^n$ to $\mathbb{R}^m$!)


Computing a Derivative

The definition

Derivatives of many of the matrix functions encountered in statistics are most easily and reliably computed from the definition: you don't really need to resort to complicated rules of matrix differentiation. This definition says that $f$ is differentiable at $x$ if and only if there is a linear transformation $L$ such that

$$f(x+h) - f(x) = Lh + o(|h|)$$

for arbitrarily small displacements $h\in \mathbb{R}^N$. The little-oh notation means that the error made in approximating the difference $f(x+h)-f(x)$ by $Lh$ is arbitrarily smaller than the size of $h$ for sufficiently small $h$. In particular, we may always ignore errors that are proportional to $|h|^2$.

The calculation

Let's apply the definition to the function in question. Multiplying, expanding, and ignoring the term with a product of two $h$'s in it,

$$\eqalign{ f(A+h)-f(A) &= \operatorname{Tr}((A+h)B(A+h)^\prime C) - \operatorname{Tr}(ABA^\prime C) \\ &= \operatorname{Tr}(hBA^\prime C) +\operatorname{Tr}(ABh^\prime C) + o(|h|).\tag{2} }$$

To identify the derivative $L=Df(A)$, we must get this into the form $(1)$. The first term on the right is already in this form, with $\omega = BA^\prime C$. The other term on the right has the form $\operatorname{Tr}(Xh^\prime C)$ for $X=AB$. Let's write this out:

$$\operatorname{Tr}(Xh^\prime C) = \sum_{i=1}^m\sum_{j=1}^n\sum_{k=1}^m X_{ij} h_{kj} C_{ki} = \sum_{i,j,k}h_{kj} \left(C_{ki}X_{ij}\right) =\operatorname{Tr}((CX)h^\prime).\tag{3}$$

Recalling $X=AB$, $(2)$ can be rewritten

$$f(A+h) - f(A) = \operatorname{Tr}(h\, BA^\prime C\,) + \operatorname{Tr}(CAB\, h^\prime\,)+o(|h|).$$

It is in this sense that we may consider the derivative of $f$ at $A$ to be $$Df(A) = (BA^\prime C)^\prime + CAB = C^\prime A B^\prime + CAB,$$ because these matrices play the roles of $\omega$ in the trace formulas $(1)$.


Putting It All Together

Here, then, is a complete solution.

Let $A$ be an $m\times n$ matrix, $B$ an $n\times n$ matrix, and $C$ an $m\times m$ matrix. Let $f(A) = \operatorname{Tr}(ABA^\prime C)$. Let $h$ be an $m\times n$ matrix with arbitrarily small coefficients. Because (by identity $(3)$) $$\eqalign{f(A+h) - f(A) &= \operatorname{Tr}(hBA^\prime C) +\operatorname{Tr}(ABh^\prime C) + o(|h|) \\ &=\operatorname{Tr}(h(C^\prime A B^\prime)^\prime + (CAB)h^\prime) + o(|h|),}$$ $f$ is differentiable and its derivative is the linear form determined by the matrix $$C^\prime A B^\prime + CAB.$$

Because this takes only about half the work and involves only the most basic manipulations of matrices and traces (multiplication and transposition), it has to be considered a simpler--and arguably more perspicuous--demonstration of the result. If you really want to understand the individual steps in the original demonstration, you might find it fruitful to compare them to the calculations shown here.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 1
    It's helpful to know that in general, $\mbox{tr}(ABC)=\mbox{tr}(CAB)$ whenever the matrices are of compatible sizes. Knowing this make (3) a trivial step. – Brian Borchers Jan 22 '17 at 23:34
  • @whuber Thank you for detailed answer. It makes a lot more sense to me now. Couple things I want to ask - 1) in $f(A+h)−f(h)=Tr(hBA′C)+Tr(CABh′)+o(|h|)$, it seems like a typo - $f(h)$->$f(A)$? 2) everything up to $f(A+h)−f(A)=Tr(hBA′C)+Tr(CABh′)+o(|h|)$ makes sense but i'm not too sure how you went from that to $Df(A)=(BA′C)′+CAB=C′AB′+CAB$. How exactly did you cancel the $tr()$ operator? – MoneyBall Jan 23 '17 at 07:10
  • @Brian Yes, that's true. From that point of view the entire problem is "trivial": just invoke a powerful enough identity to get from the left side to the right. This answer was intended to be elementary and self-contained, and so it intentionally does not appeal to anything that goes beyond a definition. The proof of that general statement about traces is similar to the calculation offered in $(3)$. – whuber Jan 23 '17 at 15:06
  • Moneyball, (1) yes, that's a typo: thank you for catching it. (2) Nothing was "canceled": the point of the section "The Trace as a Linear Form" is that in this context the *meaning* of a matrix *qua* derivative of a function of matrices has to be understood in the sense that it's part of a trace formula. In other words, when we say the derivative of $f$ is such-and-such a matrix $\omega$, what we mean is that the derivative is the linear form $X\to\operatorname{Tr}(X\omega{\,^\prime})$. – whuber Jan 23 '17 at 15:09
  • +1 but it looks like there is a gap in your exposition: I cannot find any mention of what $\partial_A \mathrm{Tr}(A\omega^\top)$ is. You say that to find the derivative of the main expression we must get it into form (1), but you never said what's the derivative of (1). – amoeba Jan 23 '17 at 22:16
  • 1
    @Amoeba I can't tell whether you are trying to be humorous or not. Neither the question nor the answer have anything directly to do with partial derivatives. The form $(1)$ explicitly is a linear form defined on the vector space $\operatorname{Mat}(m,n)$ of $m\times n$ real matrices. When somebody claims that the derivative of a function $f:\operatorname{Mat}(m,n)\to\mathbb{R}$ at a point $A$ equals some matrix $\omega$, what they mean is that $Df(A)$ is the linear form given by $X:\to\operatorname{Tr}(X\omega^{\,\prime})$. – whuber Jan 23 '17 at 22:29
  • @whuber I understand. Your entire derivation is completely clear to me, I don't have any questions myself. I was just observing that I can imagine a person such as OP reading your answer, arriving to (1), seeing your expression for $\mathrm{Tr}(X\omega')$ and wondering *Okay, but what is its derivative with respect to X?* In your last comment you are saying that it's obviously $\omega$, but it might not be obvious to all, or at least that was my point. – amoeba Jan 23 '17 at 22:38
  • @whuber I did some research and kind of figured it out. There is a different derivative equation that someone used to show how derivative of trace is computed - $D_Yf(X) = \lim_{t->0} \frac{f(X+tY) - f(X)}{t} = tr(Y^TU)$. I asked a separate question on why this works. Anyway thank you for your comprehensive answer. – MoneyBall Jan 24 '17 at 04:30
  • @Amoeba In my preceding comment I really did mean $Df(A)$. After all, $f$ is any differentiable function, not necessarily linear, while $Df(A)$ is a linear transformation approximating $f$ in a neighborhood of $A$. Moneyball: that limit formula is exactly the same approach I used, but it is freighted with the apparatus of limits, which is not really needed. The source for the approach I laid out here, as well as the notation, is Michael Spivak's *Calculus on Manifolds*, Chapter 2. Of particular importance for statistics is the *algebraic* nature of the calculations. – whuber Jan 24 '17 at 15:29
  • 1
    Ah yes, of course, thanks. Upon re-reading your answer now, I can see how everything fits together, even though I still feel that "because these matrices play the roles of $\omega$ in the trace formulas (1)" could use some additional explanation. But together with the discussion in the comments it should be clear enough. – amoeba Jan 24 '17 at 19:31
  • Another nitpick (or perhaps another misunderstanding of mine?): when you write that $Df(A) = (BA^\prime C)^\prime + CAB = C^\prime A B^\prime + CAB$, it is a bit of an abuse of notation, right? Because for $f(A)=\mathrm{Tr}(A\omega')$ the $Df(A)=f(A)=\mathrm{Tr}(A\omega')$, not $\omega$ itself. – amoeba Jan 24 '17 at 19:33
  • 2
    @Amoeba That's exactly right--it amply justifies the assertions in the first line of this answer. It is why I wrote "in *this* sense" and, later in the summary, used the phrase "determined by" rather than "equals." I won't deny that the explanation has been challenging; I'll think about how to clarify it and I appreciate all your comments and suggestions. – whuber Jan 24 '17 at 19:35
  • Do you know of any statistics book (where stuff like linear regression and so on is explained - so not graduate probability book), where everything is explained as precisely as you did here ? – l7ll7 Jan 25 '17 at 16:37
  • 2
    @user10324 Christensen's *Plane Answers to Complex Questions* is precise and widely appreciated. It uses a traditional tensor-style, "debauch of indices" notation. For a "coordinate-free" approach, which is closer to the ideas underlying my answer here, Christensen recommends Eaton, M. L. (1983). *Multivariate Statistics: A Vector Space Approach* (John Wiley and Sons, New York. Reprinted in 2007 by IMS Lecture Notes – Monograph Series.) I have not seen Eaton's book. – whuber Jan 25 '17 at 19:03
  • 1
    Christiansen's book is superbly precise, thanks a lot. *But*, even though it's more precise than any other book I've read, I've come to believe your answer http://stats.stackexchange.com/questions/246047/independent-variable-random-variable is the golden standard for precision. Unfortunately, Christiansen's book does not make the distinctions you make how, e.g., a linear model can be understood from the prespective of probability theory. Do you know of any even more precise book, that would come closer to this golden standard? – l7ll7 Jan 25 '17 at 21:57
  • @user10324 I'm not sufficiently familiar with the literature to help you there. You might have better luck looking in the research journals rather than in books. – whuber Jan 25 '17 at 23:03
  • Fair enough. But the mathematical precision from your answers here on stats.SX, does that come simply from year-long practicing of statistic, by figuring out the precise maths behind this things for yourself - or did you had any superbly precise texts (books, lecture notes etc.) that in the beginning set you on the road? As an undergraduate student in this field (but with a bachelor's degree in pure maths) I am wondering what the road is to deeply understanding statistics from a mathematical point of view (and I apologize if the question may be too personal). – l7ll7 Jan 26 '17 at 14:32
  • 1
    @user10324 Most of what I post on this site is my own formulation--I rarely consult sources (and I document them when I do). These posts are distillations from reading many books and papers. Some of the best books have not been those that are completely mathematically rigorous, but which have beautifully explained and illustrated the underlying ideas. The first few that come to mind--in order of sophistication--are Freedman, Pisani, & Purves, *Statistics* (any edition); Jack Kiefer, *Introduction to Statistical Inference*; and Steven Shreve, *Stochastic Calculus for Finance II*. – whuber Jan 26 '17 at 16:23
  • 1
    @whuber I finally have some grasp of what the linear form of the trace is. I apologize for asking the same question again on separate posts when I could've read your explanation more carefully. I do have one more question. If your equation $f(x+h)−f(x)=Lh+o(|h|)$ can be applied to find derivatives of any matrix function, does $h$ have the same dimension as $x$? So if $x \in \mathbb{R}^{m \times n}$, then $h \in \mathbb{R}^{m \times n} $? – MoneyBall Jan 31 '17 at 13:19
  • Moneyball, the expression "$x+h$" is defined only when $x$ and $h$ are elements of the same vector space--which implies your presumption is correct. – whuber Jan 31 '17 at 14:34