38

Context

The Multivariate Gaussian appears frequently in Machine Learning and the following results are used in many ML books and courses without the derivations.

Given data in form of a matrix $\mathbf{X} $ of dimensions $ m \times p$, if we assume that the data follows a $p$-variate Gaussian distribution with parameters mean $\mu$ ( $p \times 1 $) and covariance matrix $\Sigma$ ($p \times p$) the Maximum Likelihood Estimators are given by:

  • $\hat \mu = \frac{1}{m} \sum_{i=1}^m \mathbf{ x^{(i)} } = \mathbf{\bar{x}}$
  • $\hat \Sigma = \frac{1}{m} \sum_{i=1}^m \mathbf{(x^{(i)} - \hat \mu) (x^{(i)} -\hat \mu)}^T $

I understand that knowledge of the multivariate Gaussian is a pre-requisite for many ML courses, but it would be helpful to have the full derivation in a self contained answer once and for all as I feel many self-learners are bouncing around the stats.stackexchange and math.stackexchange websites looking for answers.


Question

What is the full derivation of the Maximum Likelihood Estimators for the multivariate Gaussian


Examples:

These lecture notes (page 11) on Linear Discriminant Analysis, or these ones make use of the results and assume previous knowledge.

There are also a few posts which are partly answered or closed:

Xavier Bourret Sicotte
  • 7,986
  • 3
  • 40
  • 72

3 Answers3

47

Deriving the Maximum Likelihood Estimators

Assume that we have $m$ random vectors, each of size $p$: $\mathbf{X^{(1)}, X^{(2)},...,X^{(m)}}$ where each random vectors can be interpreted as an observation (data point) across $p$ variables. If each $\mathbf{X}^{(i)}$ are i.i.d. as multivariate Gaussian vectors:

$$ \mathbf{X^{(i)}} \sim \mathcal{N}_p(\mu, \Sigma) $$

Where the parameters $\mu, \Sigma$ are unknown. To obtain their estimate we can use the method of maximum likelihood and maximize the log likelihood function.

Note that by the independence of the random vectors, the joint density of the data $\mathbf{ \{X^{(i)}}, i = 1,2,...,m\}$ is the product of the individual densities, that is $\prod_{i=1}^m f_{\mathbf{X^{(i)}}}(\mathbf{x^{(i)} ; \mu , \Sigma })$. Taking the logarithm gives the log-likelihood function

\begin{aligned} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \log \prod_{i=1}^m f_{\mathbf{X^{(i)}}}(\mathbf{x^{(i)} | \mu , \Sigma }) \\ & = \log \ \prod_{i=1}^m \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} \exp \left( - \frac{1}{2} \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \right) \\ & = \sum_{i=1}^m \left( - \frac{p}{2} \log (2 \pi) - \frac{1}{2} \log |\Sigma| - \frac{1}{2} \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \right) \end{aligned}

\begin{aligned} l(\mu, \Sigma ; ) & = - \frac{mp}{2} \log (2 \pi) - \frac{m}{2} \log |\Sigma| - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \end{aligned}

Deriving $\hat \mu$

To take the derivative with respect to $\mu$ and equate to zero we will make use of the following matrix calculus identity:

$\mathbf{ \frac{\partial w^T A w}{\partial w} = 2Aw}$ if $\mathbf{w}$ does not depend on $\mathbf{A}$ and $\mathbf{A}$ is symmetric.

\begin{aligned} \frac{\partial }{\partial \mu} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \sum_{i=1}^m \mathbf{ \Sigma^{-1} ( x^{(i)} - \mu ) } = 0 \\ & \text{Since $\Sigma$ is positive definite} \\ 0 & = m \mu - \sum_{i=1}^m \mathbf{ x^{(i)} } \\ \hat \mu &= \frac{1}{m} \sum_{i=1}^m \mathbf{ x^{(i)} } = \mathbf{\bar{x}} \end{aligned}

Which is often called the sample mean vector.

Deriving $\hat \Sigma$

Deriving the MLE for the covariance matrix requires more work and the use of the following linear algebra and calculus properties:

  • The trace is invariant under cyclic permutations of matrix products: $\mathrm{tr}\left[ABC\right] = \mathrm{tr}\left[CAB\right] = \mathrm{tr}\left[BCA\right]$
  • Since $x^TAx$ is scalar, we can take its trace and obtain the same value: $x^TAx = \mathrm{tr}\left[x^TAx\right] = \mathrm{tr}\left[x^TxA\right]$
  • $\frac{\partial}{\partial A} \mathrm{tr}\left[AB\right] = B^T$
  • $\frac{\partial}{\partial A} \log |A| = A^{-T}$
  • The determinant of the inverse of an invertible matrix is the inverse of the determinant: $|A| = \frac{1}{|A^{-1}|}$

Combining these properties allows us to calculate

$$ \frac{\partial}{\partial A} x^TAx =\frac{\partial}{\partial A} \mathrm{tr}\left[x^TxA\right] = [xx^T]^T = \left(x^{T}\right)^Tx^T = xx^T $$

Which is the outer product of the vector $x$ with itself.

We can now re-write the log-likelihood function and compute the derivative w.r.t. $\Sigma^{-1}$ (note $C$ is constant)

\begin{aligned} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \text{C} - \frac{m}{2} \log |\Sigma| - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu)^T \Sigma^{-1} (x^{(i)} - \mu) } \\ & = \text{C} + \frac{m}{2} \log |\Sigma^{-1}| - \frac{1}{2} \sum_{i=1}^m \mathrm{tr}\left[ \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)^T \Sigma^{-1} } \right] \\ \frac{\partial }{\partial \Sigma^{-1}} l(\mathbf{ \mu, \Sigma | x^{(i)} }) & = \frac{m}{2} \Sigma - \frac{1}{2} \sum_{i=1}^m \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)}^T \ \ \text{Since $\Sigma^T = \Sigma$} \end{aligned}

Equating to zero and solving for $\Sigma$

\begin{aligned} 0 &= m \Sigma - \sum_{i=1}^m \mathbf{(x^{(i)} - \mu) (x^{(i)} - \mu)}^T \\ \hat \Sigma & = \frac{1}{m} \sum_{i=1}^m \mathbf{(x^{(i)} - \hat \mu) (x^{(i)} -\hat \mu)}^T \end{aligned}

Sources

Xavier Bourret Sicotte
  • 7,986
  • 3
  • 40
  • 72
  • Alternative proofs, more compact forms, or intuitive interpretation are welcome ! – Xavier Bourret Sicotte Jun 15 '18 at 13:16
  • In the derivation for $\mu$, why does $\Sigma$ need to be positive definite? Does it seem enough that $\Sigma$ is invertible? For an invertible matrix $A$, $Ax=0$ only when $x=0$? – Tom Bennett Apr 14 '19 at 03:01
  • To clarify, $\Sigma$ is an $m \times m$ matrix that may have finite diagonal and non-diagonal components indicating correlation between vectors, correct? If that is the case, in what sense are these vectors independent? Also, why is the joint probability function equal to the likelihood? Shouldn't the joint density, $f(x,y)$, be equal to the likelihood multiplied by the prior, i.e. $f(x|y)f(y)$? – Mathews24 Apr 25 '19 at 02:07
  • 2
    @TomBennett the sigma matrix is positive definite by definition - see https://stats.stackexchange.com/questions/52976/is-a-sample-covariance-matrix-always-symmetric-and-positive-definite for the proof. The matrix calculus identity requires the matrix to be symmetric, not positive definite. But since positive definite matrices are always symmetric that works – Xavier Bourret Sicotte Apr 26 '19 at 00:24
  • @Mathews24 the sigma matrix is the covariance matrix which is positive definite by definition. The X matrix is independent. Independence allows to calculate the joint density as a product of individual densities. You can google the proof as here: https://www.probabilitycourse.com/chapter5/5_2_3_conditioning_independence.php – Xavier Bourret Sicotte Apr 26 '19 at 00:27
  • @XavierBourretSicotte Yes, I agree the covariance matrix, $\Sigma$, is positive definite by definition. But what does this covariance matrix intuitively mean? Is $\Sigma$ not the similarity or correlation between each $X^{(i)}$? There is similarly a "proof" [here](https://www.statlect.com/fundamentals-of-statistics/normal-distribution-maximum-likelihood) as well; I am simply wondering why likelihood is defined as $L = \Pi f_X$ which is exactly the same as the joint density function. But shouldn't the joint density function be equal to $L \times P$, where $P$ is the prior? Where's the prior? – Mathews24 Apr 26 '19 at 01:42
  • Uhh i think you need to revise your fundamentals - or ask a new question. The purpose here is to show the derivation of multivariate gaussian - understanding of the likelihood is a pre requisite - good luck! – Xavier Bourret Sicotte Apr 26 '19 at 01:53
  • @XavierBourretSicotte, thanks for responding. I am a bit confused. We are talking about going from $\sum_{i=1}^m \Sigma^{-1} (\mu - x^{(i)})=0$ to $\Sigma^{-1} \sum_{i=1}^m (\mu - x^{(i)})=0$ and then to $\sum_{i=1}^m (\mu - x^{(i)})=0$. Why is the requirement to be symmetric instead of invertible? If it is not invertible, then can't we just find something that sends to the null space of $\Sigma^{-1}$? Sorry it may totally be a stupid question. (totally get that $\Sigma$ is psd.) – Tom Bennett Apr 26 '19 at 03:00
  • @TomBennett honestly I don't know ! Look it up and feel free to come back with the answer. I am a student in this field not an expert ! My assumption that sigma needs to be symmetric might be wrong – Xavier Bourret Sicotte Apr 26 '19 at 04:18
  • @XavierBourretSicotte The question is asking for the derivation of the maximum likelihood of a multivariate Gaussian. I was asking about one of the steps used. I am simply saying the likelihood is defined as $L = \Pi f_X / P$ while you wrote $L = \Pi f_X$. [I am simply wondering why you did not use a prior, $P$ on the random variable which is part of the definition of the joint density function, $f_X$](https://en.wikipedia.org/wiki/Bayes%27_theorem#For_random_variables). – Mathews24 Apr 28 '19 at 17:09
  • @Mathews24 the prior is irrelevant in this context i think you are confused - read up in likelihood functions here and elsewhere https://en.wikipedia.org/wiki/Likelihood_function – Xavier Bourret Sicotte Apr 30 '19 at 16:06
  • @XavierBourretSicotte I agree independence allows one to calculate the joint density as a product of individual densities. But you mention that "the X matrix is independent", which is misleading. We know the vectors composing $X$ are necessarily dependent upon each other as prescribed by the covariance function (i.e. if non-zero, they are correlated). But there is a distinction between the observations of variables being independent (which is why the likelihood can be factorized) versus the variables being independent. – Mathews24 May 04 '19 at 17:41
  • 1
    Yes indeed - independence between observations allow to get the likelihood - the wording may be unclear faie enough - this is the multivariate version of the likelihood. The prior is still irrelevant regardless – Xavier Bourret Sicotte May 04 '19 at 18:16
  • The cyclic permutations are not cyclic, it should say tr[ABC] instead of tr[ACB]. – Kuku Nov 30 '20 at 19:00
  • @Kuku yes thanks I have edited and fixed the mistake – Xavier Bourret Sicotte Dec 07 '20 at 11:44
  • That was very useful thank you very much. I think there migh be a point that needs to be more explicit in the demonstration: the fact that the overall likelihood is concave with respect to the parameter space. Given the element your provide, I think you just need to prove that the Hessian is negative definite everywhere on the parameter space. – Tobbey Mar 14 '21 at 21:23
  • Thanks ! Feel free to post an answer below with the additional proof ! – Xavier Bourret Sicotte Mar 15 '21 at 10:17
  • Nice answer! I think there is only a minor mistake in the identity $x^TAx = \mathrm{tr}\left[x^TAx\right] = \mathrm{tr}\left[x^TxA\right]$. I think that last term should be $\mathrm{tr}\left[xx^TA\right]$ – J.Galt Feb 26 '22 at 16:08
8

An alternate proof for $\widehat{\Sigma}$ that takes the derivative with respect to $\Sigma$ directly:

Picking up with the log-likelihood as above: \begin{eqnarray} \ell(\mu, \Sigma) &=& C - \frac{m}{2}\log|\Sigma|-\frac{1}{2} \sum_{i=1}^m \text{tr}\left[(\mathbf{x}^{(i)}-\mu)^T \Sigma^{-1} (\mathbf{x}^{(i)}-\mu)\right]\\ &=&C - \frac{1}{2}\left(m\log|\Sigma| + \sum_{i=1}^m\text{tr} \left[(\mathbf{x}^{(i)}-\mu)(\mathbf{x}^{(i)}-\mu)^T\Sigma^{-1} \right]\right)\\ &=&C - \frac{1}{2}\left(m\log|\Sigma| +\text{tr}\left[ S_\mu \Sigma^{-1} \right] \right) \end{eqnarray} where $S_\mu = \sum_{i=1}^m (\mathbf{x}^{(i)}-\mu)(\mathbf{x}^{(i)}-\mu)^T$ and we have used the cyclic and linear properties of $\text{tr}$. To compute $\partial \ell /\partial \Sigma$ we first observe that $$ \frac{\partial}{\partial \Sigma} \log |\Sigma| = \Sigma^{-T}=\Sigma^{-1} $$ by the fourth property above. To take the derivative of the second term we will need the property that $$ \frac{\partial}{\partial X}\text{tr}\left( A X^{-1} B\right) = -(X^{-1}BAX^{-1})^T. $$ (from The Matrix Cookbook, equation 63). Applying this with $B=I$ we obtain that $$ \frac{\partial}{\partial \Sigma}\text{tr}\left[S_\mu \Sigma^{-1}\right] = -\left( \Sigma^{-1} S_\mu \Sigma^{-1}\right)^T = -\Sigma^{-1} S_\mu \Sigma^{-1} $$ because both $\Sigma$ and $S_\mu$ are symmetric. Then $$ \frac{\partial}{\partial \Sigma}\ell(\mu, \Sigma) \propto m \Sigma^{-1} - \Sigma^{-1} S_\mu \Sigma^{-1}. $$ Setting this to 0 and rearranging gives $$ \widehat{\Sigma} = \frac{1}{m}S_\mu. $$

This approach is more work than the standard one using derivatives with respect to $\Lambda = \Sigma^{-1}$, and requires a more complicated trace identity. I only found it useful because I currently need to take derivatives of a modified likelihood function for which it seems much harder to use $\partial/{\partial \Sigma^{-1}}$ than $\partial/\partial \Sigma$.

Eric Kightley
  • 245
  • 2
  • 6
0

While previous answers are correct, mentioning the trace is unnecessary (from a personal point of view).

The following derivation might be more succinct:

enter image description here

Wiza
  • 124
  • 4
  • isnt this exactly the same as my answer, but skipping the steps explaining the derivative of the matrix ? Apologies but I fail to see what this is adding – Xavier Bourret Sicotte Oct 29 '21 at 15:44
  • Personally, I find proofs (in my textbooks) with trace messing up with my ability to remember/understand/appreciate the proof. It's like proving another theorem (2 in my answer) every time, since 2 in my answer is standard results in Matrix reference book, as I listed. This is just for people who might have the same issue. Anyway, all best intentions! – Wiza Oct 30 '21 at 04:14