68

I'm having trouble deriving the KL divergence formula assuming two multivariate normal distributions. I've done the univariate case fairly easily. However, it's been quite a while since I took math stats, so I'm having some trouble extending it to the multivariate case. I'm sure I'm just missing something simple.

Here's what I have...

Suppose both $p$ and $q$ are the pdfs of normal distributions with means $\mu_1$ and $\mu_2$ and variances $\Sigma_1$ and $\Sigma_2$, respectively. The Kullback-Leibler distance from $q$ to $p$ is:

$\int \left[\log( p(x)) - \log( q(x)) \right]\ p(x)\ dx$, which for two multivariate normals is:

$\frac{1}{2}\left[\log\frac{|\Sigma_2|}{|\Sigma_1|} - d + Tr(\Sigma_2^{-1}\Sigma_1) + (\mu_2 - \mu_1)^T \Sigma_2^{-1}(\mu_2 - \mu_1)\right]$

Following the same logic as this proof, I get to about here before I get stuck:

$=\int \left[ \frac{d}{2} \log\frac{|\Sigma_2|}{|\Sigma_1|} + \frac{1}{2} \left((x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2) - (x-\mu_1)^T\Sigma_2^{-1}(x-\mu_1) \right) \right] \times p(x) dx$

$=\mathbb{E} \left[ \frac{d}{2} \log\frac{|\Sigma_2|}{|\Sigma_1|} + \frac{1}{2} \left((x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2) - (x-\mu_1)^T\Sigma_2^{-1}(x-\mu_1) \right) \right]$

I think I have to implement the trace trick, but I'm just not sure what to do after that. Any helpful hints to put me back on the right track would be appreciated!

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
dmartin
  • 3,010
  • 3
  • 22
  • 27

1 Answers1

67

Starting with where you began with some slight corrections, we can write

$$ \begin{aligned} KL &= \int \left[ \frac{1}{2} \log\frac{|\Sigma_2|}{|\Sigma_1|} - \frac{1}{2} (x-\mu_1)^T\Sigma_1^{-1}(x-\mu_1) + \frac{1}{2} (x-\mu_2)^T\Sigma_2^{-1}(x-\mu_2) \right] \times p(x) dx \\ &= \frac{1}{2} \log\frac{|\Sigma_2|}{|\Sigma_1|} - \frac{1}{2} \text{tr}\ \left\{E[(x - \mu_1)(x - \mu_1)^T] \ \Sigma_1^{-1} \right\} + \frac{1}{2} E[(x - \mu_2)^T \Sigma_2^{-1} (x - \mu_2)] \\ &= \frac{1}{2} \log\frac{|\Sigma_2|}{|\Sigma_1|} - \frac{1}{2} \text{tr}\ \{I_d \} + \frac{1}{2} (\mu_1 - \mu_2)^T \Sigma_2^{-1} (\mu_1 - \mu_2) + \frac{1}{2} \text{tr} \{ \Sigma_2^{-1} \Sigma_1 \} \\ &= \frac{1}{2}\left[\log\frac{|\Sigma_2|}{|\Sigma_1|} - d + \text{tr} \{ \Sigma_2^{-1}\Sigma_1 \} + (\mu_2 - \mu_1)^T \Sigma_2^{-1}(\mu_2 - \mu_1)\right]. \end{aligned} $$

Note that I have used a couple of properties from Section 8.2 of the Matrix Cookbook.

sss1
  • 3
  • 4
ramhiser
  • 1,683
  • 14
  • 14
  • I see you took out the D that I had originally. Wouldn't you have a D term after taking the log of the Gaussian in the first few steps? – dmartin Jun 03 '13 at 15:19
  • 1
    Consider the scaling factor $(2\pi)^{-d/2} |\Sigma_k|^{-1/2}$, $k = 1,2$ of the multivariate normal density. When computing the log-difference, the $(2\pi)^{-d/2}$ term goes away. There is no $d$ term for the determinants -- simply, a $1/2$, which is factored out. – ramhiser Jun 03 '13 at 15:33
  • Hi, how did you come up with the last step? How did you change sign of $\mu_1 - \mu_2$ into $\mu_2 - \mu_1$? – acidghost Apr 11 '16 at 06:10
  • 1
    @acidghost Either one works because we can factor out a negative one from both sides. Multiplying the two negative ones yields a positive one. – ramhiser Apr 12 '16 at 00:06
  • @JohnA.Ramey. Which property of the sec 8.2 you used? Is it eq. 380 ? If yes, I'm not able to follow, Can you explain? – CKM Aug 23 '17 at 06:14
  • @chandresh Use eq. 377 to simplify the second expression on line 2. Use eq. 380 to simplify the last expression on line 2. – ramhiser Aug 23 '17 at 21:07
  • I would still recommend https://stanford.edu/~jduchi/projects/general_notes.pdf page 13. Much easier to follow. – jubueche Jan 18 '21 at 17:03