Covarince matrix has to be be rank-deficient to maximize multivariate gaussian likelihood?

Question

I am facing a similar situation as that addressed in this question, and the accepted answer there has helped me a lot, but I need to resolve a doubt.

The accepted answer draws on the excellent resource "Matrix Cookbook" to show that

$$\frac{\partial \mathbf{L}}{\partial \mathbf{\Sigma}}= -1/2\left(\mathbf{\Sigma^{-1}-\Sigma^{-1}(y-\mu)(y-\mu)'\Sigma^{-1}}\right)$$

where $\mathbf{L}$ is the log likelihood of the gaussian vector $\mathbf{y}$ with covariance matrix $\mathbf{\Sigma}$ and mean $\mathbf{\mu}$

If I'm not mistaken, to solve for the $\mathbf{\Sigma}$ that maximizes $\mathbf{L}$, one would then set $\frac{\partial \mathbf{L}}{\partial \mathbf{\Sigma}}=0$ and get

$$\mathbf{\Sigma^{-1}=\Sigma^{-1}(y-\mu)(y-\mu)'\Sigma^{-1}}$$

Pre- and post-multiplying by $\mathbf{\Sigma}$ gives

$$\mathbf{\Sigma=(y-\mu)(y-\mu)'}$$

Which brings me to my doubt. The RHS of this is an outer product with determinant 0 and thus not invertible. But the covariance matrix $\mathbf{\Sigma}$ on the LHS must be invertible. Likewise, the intermediate equation

$$\mathbf{I=\Sigma^{-1}(y-\mu)(y-\mu)'}$$

seems to be a contradiction since $\mathbf{[(y-\mu)(y-\mu)']^{-1}}$ does not exist.

Anyways, what this seems to say to me is that the condition for maximum $\mathbf{L}$ requires that $\mathbf{\Sigma}$ be rank deficient with determinant 0, in which case it could not really be called a covariance matrix, and $\mathbf{L}$ would be undefined at its maximum.

And yet the answerer says that he/she uses these formulae all the time for ML parameter estimation, so I guess I am missing something. Please help.

score 1 · Accepted Answer · answered Jan 02 '18 at 19:06

The derivative you describe is with respect to the density (and hence the likelihood) of a single observation. If instead you suppose that you have $n$ observations independently drawn from the same distribution, the full likelihood is $$ L(\theta|y_1, \dots, y_n)=\prod_{i=1}^n p(y_i|\theta), $$ by which $$ \log L(\theta|y_1, \dots, y_n)=\sum_{i=1}^n \log p(y_i|\theta). $$

The derivative in your post is $$ \frac{\partial \log p(y_i|\mu, \Sigma)}{\partial \Sigma} $$ which is summed over $i=1, \dots, n$ yielding $$ \frac{\partial \log L(\mu, \Sigma|y_1, \dots, y_n)}{\partial \Sigma}=\sum_{i=1}^n \frac{\partial \log p(y_i|\mu, \Sigma)}{\partial \Sigma}=-\frac{n}{2}\Sigma^{-1}+\frac{1}{2}\Sigma^{-1}\left[\sum_{i=1}^n(y_i-\mu)(y_i-\mu)'\right]\Sigma^{-1}. $$ Solving the same way you did results in $$ \Sigma=\frac{1}{n}\sum_{i=1}^n(y_i-\mu)(y_i-\mu)'. $$

If you now let $Z=(z_1, \dots, z_n)'$ (a $n\times k$ matrix) where $z_i=y_i-\mu$, you may equivalently express this as $$ \Sigma=\frac{1}{n}Z'Z. $$ Note here that the rank of $Z'Z$ is at most $k$ as it's a $k\times k$ matrix. Further, the rank of $Z$ is at most $\min(k, n)$ (i.e. the smallest of the column and row rank). In fact, with the rows of $Z$ being independent draws from a normal distribution the rank of $Z$ will be exactly $\min(n, k)$ with probability one. Thus, for the example in your original post, you have $n=1$ so that $rank(Z)=1$ and hence $rank(Z'Z)=1$ and the estimate of the covariance matrix is singular. For non-singularity, you will need $n\geq k$.

Covarince matrix has to be be rank-deficient to maximize multivariate gaussian likelihood?

1 Answers1