Closed form ML estimation of GMM with known class assignments

Question

In Andrew Ng's CS229 notes, Gaussian mixture model and its likelihood function are given as follows:

\begin{eqnarray} z^{(i)} \sim \textrm{Multinomial}(\phi)\\ \phi_j \geq 0\\ \sum_{j=1}^k \phi_j = 1\\ p(z^{(i)}=j)=\phi_j\\ x^{(i)}|z^{(i)}=j \sim \cal{N}(\mu_j,\Sigma_j)\\ \cal{L}(\phi,\mu,\Sigma)=\sum_{i=1}^m\log p(x^{(i)};\phi,\mu,\Sigma)=\sum_{i=1}^m\log \sum_{z^{(i)}=1}^k p(x^{(i)}|z^{(i)};\mu, \Sigma)p(z^{(i)};\phi) ~~~~ \textrm{(1)} \end{eqnarray}

where $\lbrace x^{(1)},..., x^{(m)}\rbrace$ are the training set and $z^{(i)}$s are the related latent variables.

In the notes it is said that if we had known the class labels, we could have written the likelihood as follows:

\begin{eqnarray} \cal{L}(\phi,\mu,\Sigma)=\sum_{i=1}^m\log p(x^{(i)}|z^{(i)};\mu, \Sigma)+\log p(z^{(i)};\phi) ~~~~~~~~~~\textrm{(2)} \end{eqnarray}

And in this case parameter estimations are given as follows:

\begin{eqnarray} \phi_j=\frac{1}{m}\sum_{i=1}^m 1\lbrace {z^{(i)}=j}\rbrace ~~~~~~~~~\textrm{(3)}\\ \mu_j=\frac{\sum_{i=1}^m 1\lbrace {z^{(i)}=j}\rbrace x^{(i)}}{\sum_{i=1}^m 1\lbrace {z^{(i)}=j}\rbrace}~~~~~~~~~\textrm{(4)}\\ \Sigma_j=\frac{\sum_{i=1}^m 1\lbrace {z^{(i)}=j}\rbrace (x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^m 1\lbrace {z^{(i)}=j}\rbrace}~~~~~\textrm{(5)} \end{eqnarray}

1) Is it okay to call $z^{(i)}$s class labels and $\phi_j$s class proportions?

2) As far as I understand, we cannot analytically optimize (1) by taking the derivative and equating to zero because of the $\log \sum$ term. Is it correct?

3) How did we write (2)? I think I get the intuition to some level but I cannot correctly write it by using the indicator function etc.

4) How did we obtain (3), (4) and (5)? For $\phi_j$ I got the following which is obviously not true at all: \begin{eqnarray} \frac{\partial}{\partial \phi_j}\cal{L}(\phi,\mu,\Sigma)=\sum_{i=1}^m \lbrace{z^{(i)}=j\rbrace}\frac{1}{\phi_j} \end{eqnarray}

5) If we don't know class labels (like in (1)), is it a constrained optimization problem and do we need to use Lagrange multipliers in this case? Because $\phi_j$ sum up to 1.

Jonas · Accepted Answer · 2018-12-06T08:35:59.553

First, note that whenever you write $\mathcal{L}$ you actually refer to the $\log$-likelihood. I will stick to this rather uncommon notation. To reduce confusion, I will use $z^{(i)}$ to refer to deterministic values and $Z^{(i)}$ to refer to random variables.

1) Sure.

2) You probably mean a closed-form solution for the Maximum-Likelihood estimation of $(\phi_j, \mu_j, \Sigma_j : j = 1,...,k)$ and $(z^{(i)} : i=1,...,m)$. In general, I am indeed pretty sure, that such a solution is not available, or at least very hard to compute. I have not tried computing it though. Also note, that this likelihood may have various stationary points, hence second order information will be needed.

3) Look at equation (1). There, we considered $Z^{(i)}$ uncertain for $i = 1,...,k$. In equation (2), we know $Z^{(i)} = z^{(i)}$ and just define the probabilities $$p(Z^{(i)}; \phi) := \begin{cases}1, &Z^{(i)} = z^{(i)} \\ 0, &Z^{(i)} \neq z^{(i)} \end{cases}.$$ Now, only one term in the sum over $\sum_{z^{(i)}= 1}^k$ is non-zero and we obtain (2).

4) Computing the maximum likelihood estimator for the multinomial distribution is non-trivial, in my opinion. Note that $\phi_1,...,\phi_k$ are all coupled, since they sum to 1. Hence, when computing the optimum we cannot rely on this partial derivative, but need Lagrange multipliers, as you mentioned correctly in 5). Maybe, look at this thread in the MSE community. (4) and (5) are not that difficult to get. Note that all samples (x^{(i)}) that fulfill $z^{(i)} = j$ follow the Gaussian measure $\mathrm{N}(\mu_j, \Sigma_j)$. The estimators you've copied are just the typical maximum likelihood estimators for the Gaussian measure, given that you use only the samples of a particular class $j$ to estimate $\mu_j, \Sigma_j$.

5) Yes, indeed.

Thanks. I have follow up questions. 3) We say that $\phi_j$s take values between 0 and 1, but if we observe $Z^{(i)}$ how do one of them become exactly 1 and others 0? 4) If we know the class labels, we know which $\phi_j$ is 1, so do we still need Lagrange multipliers? And for $\mu$ and $\Sigma$ it can be easier, but I don't get what to do with indicator functions. Maybe (2) can be written as $\sum_{i=1}^m \log(p(x^{(i)}|z^{(i)};\mu_j,\Sigma_j)p(z^{(i)};\phi_j))^{\lbrace z^{(i)}=j \rbrace}$, but indicator functions don't go away even the derivative is zero, which confuses me. — groove, Dec 06 '18 at 05:03
3) Yes, I should have clarified. The probabilities that I defined there have nothing to do, with the original multinomial measure. It is just a way to state that these events $\{Z^{(i)} = z^{(i)}\}$ happen with probability 1 or 0. But you cannot take any implications about $\phi_i$ from this formula. By writing $p(Z^{(i)}; \phi)$ I actually meant $P(Z^{(i)} = z^{(i)}).$ 4) No. We do not know whether some $\phi_j$ is 0 or 1. Sorry for the confusion. We can write (4) as: $$\mu_j = \frac{\sum_{i=1,...,m : z^{(i)}=j} x^{(i)}}{\#\{i : z^{(i)}=j\}}.$$ Does that make sense? — Jonas, Dec 06 '18 at 08:30
Yes, but I still have some trouble with the intuition. I will read some more stuff and think about your answers again. Thanks. — groove, Dec 17 '18 at 19:40

Closed form ML estimation of GMM with known class assignments

1 Answers1

Linked