Derivative of $x^T A^Ty$ with respect to $\Sigma$ where $A$ is (an upper triangle matrix and ) Cholesky decomposition of $\Sigma$

Question

I would like to evaluate:

$$ \frac{ \partial x^T A^Ty}{\partial \Sigma} $$ where $A$ is a Cholesky decomposition of $\Sigma$ and an upper triangle matrix such that $\Sigma = A^T A$, $x$ and $y$ are a vector of length the same as the dimension of the square matrix $A$.

I am not so confident about my matrix calculus skill and hoping that someone gives a tip.

did you really mean to take the derivative of the quadratic form with respect to the matrix? — Placidia, Feb 20 '15 at 23:54
also interested in knowing the statistical import of this problem. — Placidia, Feb 20 '15 at 23:55
@Placidia differentiating a quadratic form wrt a matrix comes up, e.g., when solving for the MLE of $\Sigma$ given $\mathcal N(\mu, \Sigma)$ data; I imagine this might be similar. — guy, Feb 21 '15 at 00:31
@Placidia Yes, I would like to take a derivative of a scalar function with respect to a matrix argument $\Sigma$. The resulting derivative should be a matrix such that each element of the resulting matrix is the derivative of the scalar function with respect to the entry of $\Sigma$. I encountered this problem when I was trying to obtain the standard error of the MLE of the variance of Normal distribution just like guy said. — FairyOnIce, Feb 21 '15 at 02:51

score 7 · Accepted Answer · edited Apr 22 '15 at 15:58

By the chain rule,

$\frac{\partial x^{T}A^{T}y}{\partial \Sigma_{i,j}}= \mbox{tr} \left( \left( \frac{\partial x^{T}A^{T}y}{\partial A^{T}} \right)^{T} \frac{\partial A^{T}}{\partial \Sigma_{i,j} } \right) $.

This chain rule formulation is described in many references on matrix calculus, such as The Matrix Cookbook of Petersen and Pedersen.

The first partial derivative is easy:

$ \frac{\partial x^{T}A^{T}y}{\partial A^{T}}=xy^{T}$.

You can find a useful formula for the derivative of the Cholesky factor with respect to elements of $\Sigma$ on page 211 of Bayesian Filter and Smoothing by Simo Sarkka

(Note that the book uses $P=AA^{T}$ rather than $\Sigma=A^{T}A$, so the notation is complicated. I've transposed everything in the book to match the notation used in your statement of the problem.) After the change of notation, this formula gives:

$\frac{\partial A^{T}}{\partial \Sigma_{i,j}}=A^{T} \Phi \left(A^{-T} E_{i,j} A^{-1} \right) $

where $\Phi_{k,l}(M)=M_{k,l}$ if $k>l$, $\Phi_{k,l}(M)=M_{k,l}/2$ if $k=l$, and $\Phi_{k,l}(M)=0$ if $k<l$. This is basically the lower triangle of the matrix but with the diagonal divided by 2. $E_{i,j}$ is the zero matrix with one's in the $(i,j)$ and $(j,i)$ positions.

I can't see any particular way to simplify this further.

I have tested this in MATLAB by comparing the formula against a finite difference approximation and the results match up.

In the automatic differentiation world, people implement the Cholesky factorization with computation of the derivatives done at the same time as the Cholesky factorization. This is mentioned in the book cited in my answer. — Brian Borchers, Feb 23 '15 at 03:00
Thank you for guiding me to this book "BAYESIAN FILTERING AND SMOOTHING". Theorem A.1 in page 211 is what I have been looking for for a loooong time. — FairyOnIce, Feb 23 '15 at 06:23

Derivative of $x^T A^Ty$ with respect to $\Sigma$ where $A$ is (an upper triangle matrix and ) Cholesky decomposition of $\Sigma$

1 Answers1

Related