39

Say I have multivariate normal $N(\mu, \Sigma)$ density. I want to get the second (partial) derivative w.r.t. $\mu$. Not sure how to take derivative of a matrix.

Wiki says take the derivative element by element inside the matrix.

I am working with Laplace approximation $$\log{P}_{N}(\theta)=\log {P}_{N}-\frac{1}{2}{(\theta-\hat{\theta})}^{T}{\Sigma}^{-1}(\theta-\hat{\theta}) \>.$$
The mode is $\hat\theta=\mu$.

I was given $${\Sigma}^{-1}=-\frac{{{\partial }^{2}}}{\partial {{\theta }^{2}}}\log p(\hat{\theta }|y),$$ how did this come about?

What I have done:
$$\log P(\theta|y) = -\frac{k}{2} \log 2 \pi - \frac{1}{2} \log \left| \Sigma \right| - \frac{1}{2} {(\theta-\hat \theta)}^{T}{\Sigma}^{-1}(\theta-\hat\theta)$$

So, I take derivative w.r.t to $\theta$, first off, there is a transpose, secondly, it is a matrix. So, I am stuck.

Note: If my professor comes across this, I am referring to the lecture.

user1061210
  • 1,005
  • 3
  • 13
  • 19
  • 1
    part of your problem may be that your expression for the log-likelihood has an error - you have $|\Sigma|$ where you should have $\log(|\Sigma|)$. Also, by any chance did you mean ${\Sigma}^{-1}=-\frac{{{\partial }^{2}}}{\partial {{\theta }^{2}}}\log p(\theta|y)$? – Macro May 01 '12 at 12:19
  • Yes, you are right, sorry. Why is there negative sign in front of the partial derivative? – user1061210 May 01 '12 at 14:38
  • I was just clarifying about the negative sign because, the negative second derivative is the observed fisher information, which is usually of interest. Also, by my own calculation, I'm finding that $\frac{{{\partial }^{2}}}{\partial {{\theta }^{2}}}\log p(\theta|y) = -\Sigma^{-1}$ – Macro May 01 '12 at 14:48
  • So, what is the general procedure for discrete/continuous function? Take log, write in Taylor expansion form, differentiate twice w.r.t. $\theta$. Fisher info is not generally true most other densities, right? – user1061210 May 01 '12 at 15:19
  • I also did a beta example, seem like the ${\Sigma}^{-2}$ is always the NEGATIVE second partial derivative. – user1061210 May 01 '12 at 15:35
  • What general procedure are you referring to? In my answer below, I only took the derivative of the log likelihood with respect to ${\boldsymbol \mu}$ and ${\boldsymbol \Sigma}$. Also, the fisher information is defined for other distributions - it is defined as the expected gradient outer product (which happens to equal the negative expected hessian) of the log-likelihood - I'm not sure what you meant by "Fisher info is not generally true most other densities, right?" – Macro May 01 '12 at 15:41
  • Sorry, I was vague. I am learning laplace approximation to a posterior mode. The idea behind that is to do a Taylor expansion at mode. In Taylor expansion, all terms and derivatives are positive, thus, I don't get why the NEGATIVE sign. – user1061210 May 01 '12 at 15:46
  • @User In the Taylor expansion around a mode (a local *maximum*), the first derivatives had better all be zero and the second derivatives *negative* (or zero), for otherwise the function is convex *upward* and you're at a local minimum! – whuber May 01 '12 at 16:03
  • @whuber Thanks, got it! so, it IS the case that first derivative=0, second derivative<=0. – user1061210 May 01 '12 at 16:29
  • @whuber: in this case ${{\Sigma }^{-1}}=-\frac{{{\partial }^{2}}}{\partial {{\theta }^{2}}}\log p(\hat{\theta }|y)$, why are we slapping a negative sign in front of it. Is variance always from the negative of the second partial derivative? I see this happening in Beta aslo. – user1061210 May 01 '12 at 17:22
  • 3
    @user As I pointed out, the second derivative of the logarithm *must* have non-positive eigenvalues. Yes, there are links between variances and negative second partial derivatives, as the theory of maximum likelihood estimation, Fisher information, etc., reveals--Macro has referred to that earlier in these comments. – whuber May 01 '12 at 19:19

3 Answers3

78

In chapter 2 of the Matrix Cookbook there is a nice review of matrix calculus stuff that gives a lot of useful identities that help with problems one would encounter doing probability and statistics, including rules to help differentiate the multivariate Gaussian likelihood.

If you have a random vector ${\boldsymbol y}$ that is multivariate normal with mean vector ${\boldsymbol \mu}$ and covariance matrix ${\boldsymbol \Sigma}$, then use equation (86) in the matrix cookbook to find that the gradient of the log likelihood ${\bf L}$ with respect to ${\boldsymbol \mu}$ is

$$\begin{align} \frac{ \partial {\bf L} }{ \partial {\boldsymbol \mu}} &= -\frac{1}{2} \left( \frac{\partial \left( {\boldsymbol y} - {\boldsymbol \mu} \right)' {\boldsymbol \Sigma}^{-1} \left( {\boldsymbol y} - {\boldsymbol \mu}\right) }{\partial {\boldsymbol \mu}} \right) \nonumber \\ &= -\frac{1}{2} \left( -2 {\boldsymbol \Sigma}^{-1} \left( {\boldsymbol y} - {\boldsymbol \mu}\right) \right) \nonumber \\ &= {\boldsymbol \Sigma}^{-1} \left( {\boldsymbol y} - {\boldsymbol \mu} \right) \end{align}$$

I'll leave it to you to differentiate this again and find the answer to be $-{\boldsymbol \Sigma}^{-1}$.

As "extra credit", use equations (57) and (61) to find that the gradient with respect to ${\boldsymbol \Sigma}$ is

$$ \begin{align} \frac{ \partial {\bf L} }{ \partial {\boldsymbol \Sigma}} &= -\frac{1}{2} \left( \frac{ \partial \log(|{\boldsymbol \Sigma}|)}{\partial{\boldsymbol \Sigma}} + \frac{\partial \left( {\boldsymbol y} - {\boldsymbol \mu}\right)' {\boldsymbol \Sigma}^{-1} \left( {\boldsymbol y}- {\boldsymbol \mu}\right) }{\partial {\boldsymbol \Sigma}} \right)\\ &= -\frac{1}{2} \left( {\boldsymbol \Sigma}^{-1} - {\boldsymbol \Sigma}^{-1} \left( {\boldsymbol y} - {\boldsymbol \mu} \right) \left( {\boldsymbol y} - {\boldsymbol \mu} \right)' {\boldsymbol \Sigma}^{-1} \right) \end{align} $$

I've left out a lot of the steps, but I made this derivation using only the identities found in the matrix cookbook, so I'll leave it to you to fill in the gaps.

I've used these score equations for maximum likelihood estimation, so I know they are correct :)

David Kelley
  • 123
  • 4
Macro
  • 40,561
  • 8
  • 143
  • 148
  • 4
    Great reference - was going to recommend it myself. Not a good pedagogical reference for someone who doesn't know matrix algebra though. The real challenge comes from actually working out $\Sigma$. A real pain. – probabilityislogic May 01 '12 at 11:11
  • 3
    Another good source on matrix calculus is Magnus & Neudecker, http://www.amazon.com/Differential-Calculus-Applications-Statistics-Econometrics/dp/047198633X – StasK May 01 '12 at 15:32
  • 2
    The equation's reference number has been changed (maybe due to a new edition). The new reference equation is 86. – goelakash May 17 '16 at 09:22
  • 2
    I could be off-base here but I don't think this formula is correct. I've been using this with real examples and looking at their finite differences. It seems that the formula for $\frac{ \partial {\bf L} }{ \partial {\boldsymbol \Sigma}}$ gives the correct values for the diagonal entries. However, the off-diagonal entries are half of what they should be. – jjet Apr 09 '18 at 19:27
  • use equations (57), there should be a transpose mark too, where is the transpose mark? – Ernie Sender Dec 08 '20 at 23:30
  • according to (61) there should be transpose mark on the Σ too, where's the transpose mark? – Ernie Sender Dec 08 '20 at 23:49
  • ok answer to myself, since covariance matrix is a symmetric matrix – Ernie Sender Dec 09 '20 at 00:26
6

Expression for log of normal density

We consider the log of the normal density \begin{align} \log p(y|\mu,\Sigma)=-\frac{D}{2}\log{|2\pi|}-\frac{1}{2}\log{|\Sigma|}-\frac{1}{2}(y-\mu)^\top\Sigma^{-1}(y-\mu)\quad\quad(1) \end{align} where $D$ denotes the dimension of $y$ and $\mu$.

Derivative w.r.t. mean

We have \begin{align} \frac{\partial\log p(y|\mu,\Sigma)}{\mu}=\Sigma^{-1}(y-\mu) \end{align} from (96, 97) the Matrix Cookbook and noting the first two terms on the r.h.s. of (1) differentiate to 0.

Derivative w.r.t. covariance

This requires careful consideration of the fact that $\Sigma$ is symmetric - see example at the bottom for the importance of taking this into account!

We have by (141) the Matrix Cookbook that for a symmetric $\Sigma$ the following derivatives

\begin{align} \frac{\partial \log|\Sigma|}{\partial \Sigma}&=2\Sigma^{-1}-(\Sigma^{-1}\circ I) \end{align}

and (139) the Matrix Cookbook gives \begin{align} \frac{\partial \textrm{trace}(\Sigma^{-1}xx^\top)}{\partial \Sigma}&=-2\Sigma^{-1}xx^\top\Sigma^{-1}+(\Sigma^{-1}xx^\top\Sigma^{-1}\circ I) \end{align}

where $\circ$ denotes the Hadmard product and for convenience we have defined $x:=y-\mu$. Note that both expressions would be different is $\Sigma$ was not required to be symmetric. Putting these together we have

\begin{align} \frac{\partial\log p(y|\mu,\Sigma)}{\Sigma}&=-\frac{\partial }{\partial \Sigma}\frac{1}{2}\left(D\log|2\pi|+ \log|\Sigma| + x^{\top}\Sigma^{-1}x)\right)\\ &=-\frac{\partial }{\partial \Sigma}\frac{1}{2}\left( \log|\Sigma| + \textrm{trace}(\Sigma^{-1}xx^\top)\right)\\ &=-\frac{1}{2}\left( 2\Sigma^{-1}-(\Sigma^{-1}\circ I) -2\Sigma^{-1}xx^\top\Sigma^{-1}+(\Sigma^{-1}xx^\top\Sigma^{-1}\circ I)\right) \end{align}

as the derivative of $\frac{D}{2}\log|2\pi|$ is 0.

Note that it is WRONG to ignore that $\Sigma$ is symmetric


Impact of $\Sigma$ being symmetric

This example shows why you can't just ignore the fact $\Sigma$ is symmetric when differentiating with respect to its elements. Consider the matrix function \begin{align} f(X)=\Sigma_{ij} X_{ij} \end{align} so just sums up all the elements of $X$, some arbitrary matrix. If we consider

\begin{align*} \\ &1)\quad X =\left(\begin{array}{cc} a & b\\ c & d \end{array}\right) & \implies && \frac{df}{dX} & =\left(\begin{array}{cc} 1 & 1\\ 1 & 1 \end{array}\right)\\ &2)\quad X^{*}=\left(\begin{array}{cc} a & b\\ b & c \end{array}\right) & \implies && \frac{df}{dX^{*}} & =\left(\begin{array}{cc} 1 & 2\\ 2 & 1 \end{array}\right) \end{align*}

Then we see obviously the derivatives of $f$ w.r.t. the elements of $X$ vary depending on whether $X$ is symmetric or not.

1

I tried to computationally verify @Macro's answer but found what appears to be a minor error in the covariance solution. He obtained $$ \begin{align} \frac{ \partial {\bf L} }{ \partial {\boldsymbol \Sigma}} &= -\frac{1}{2} \left( {\boldsymbol \Sigma}^{-1} - {\boldsymbol \Sigma}^{-1} \left( {\boldsymbol y} - {\boldsymbol \mu} \right) \left( {\boldsymbol y} - {\boldsymbol \mu} \right)' {\boldsymbol \Sigma}^{-1} \right) ={\bf A} \end{align} $$ However, it appears that the correct solution is actually $$ {\bf B}=2{\bf A} - \text{diag}({\bf A}) $$ The following R script provides a simple example in which the finite difference is calculated for each element of ${\boldsymbol \Sigma}$. It demonstrates that ${\bf A}$ provides the correct answer only for diagonal elements while ${\bf B}$ is correct for every entry.

library(mvtnorm)

set.seed(1)

# Generate some parameters
p <- 4
mu <- rnorm(p)
Sigma <- rWishart(1, p, diag(p))[, , 1]

# Generate an observation from the distribution as a reference point
x <- rmvnorm(1, mu, Sigma)[1, ]

# Calculate the density at x
f <- dmvnorm(x, mu, Sigma)

# Choose a sufficiently small step-size
h <- .00001

# Calculate the density at x at each shifted Sigma_ij
f.shift <- matrix(NA, p, p)
for(i in 1:p) {
  for(j in 1:p) {
    zero.one.mat <- matrix(0, p, p)
    zero.one.mat[i, j] <- 1
    zero.one.mat[j, i] <- 1

    Sigma.shift <- Sigma + h * zero.one.mat
    f.shift[i, j] <- dmvnorm(x, mu, Sigma.shift)
  }
}

# Caluclate the finite difference at each shifted Sigma_ij
fin.diff <- (f.shift - f) / h

# Calculate the solution proposed by @Macro and the true solution
A <- -1/2 * (solve(Sigma) - solve(Sigma) %*% (x - mu) %*% t(x - mu) %*% solve(Sigma))
B <- 2 * A - diag(diag(A))

# Verify that the true solution is approximately equal to the finite difference
fin.diff
A * f
B * f
jjet
  • 1,187
  • 7
  • 12
  • Thank you for your comment. I believe you interpret the notation differently than everyone else has, because you *simultaneously* change pairs of matching off-diagonal elements of $\Sigma$, thereby doubling the effect of the change. In effect you are computing a *multiple* of a directional derivative. There does appear to be a small problem with Macro's solution insofar as a *transpose* ought to be taken--but that would change nothing in the application to symmetric matrices. – whuber Apr 10 '18 at 21:08
  • 1
    @whuber, I believe jjet is actually right, and this answer is consistent with Lawrence Middleton's answer posted above. – husB Nov 26 '20 at 15:44
  • 1
    From eq. 138 of [the matrix cookbook](http://www.math.uwaterloo.ca/~hwolkowi//matrixcookbook.pdf), the correct solution should be $\mathbf{A}+\mathbf{A}^T-\text{diag}(\mathbf{A})$, which simplifies to $2\mathbf{A}-\text{diag}(\mathbf{A})$. – husB Nov 26 '20 at 15:54
  • Yeah, he just didn’t interpret the problem correctly. – jjet Nov 27 '20 at 18:05
  • But I thank him for his comment. – jjet Nov 27 '20 at 18:09
  • @jjet you're totally right in terms of what you've tried to show. I've written a clearer explanation as to why you need to be careful when Sigma is symmetric. – Lawrence Middleton Jan 26 '22 at 23:29