In Andrew Ng's CS229 notes, Gaussian mixture model and its likelihood function are given as follows:
\begin{eqnarray} z^{(i)} \sim \textrm{Multinomial}(\phi)\\ \phi_j \geq 0\\ \sum_{j=1}^k \phi_j = 1\\ p(z^{(i)}=j)=\phi_j\\ x^{(i)}|z^{(i)}=j \sim \cal{N}(\mu_j,\Sigma_j)\\ \cal{L}(\phi,\mu,\Sigma)=\sum_{i=1}^m\log p(x^{(i)};\phi,\mu,\Sigma)=\sum_{i=1}^m\log \sum_{z^{(i)}=1}^k p(x^{(i)}|z^{(i)};\mu, \Sigma)p(z^{(i)};\phi) ~~~~ \textrm{(1)} \end{eqnarray}
where $\lbrace x^{(1)},..., x^{(m)}\rbrace$ are the training set and $z^{(i)}$s are the related latent variables.
In the notes it is said that if we had known the class labels, we could have written the likelihood as follows:
\begin{eqnarray} \cal{L}(\phi,\mu,\Sigma)=\sum_{i=1}^m\log p(x^{(i)}|z^{(i)};\mu, \Sigma)+\log p(z^{(i)};\phi) ~~~~~~~~~~\textrm{(2)} \end{eqnarray}
And in this case parameter estimations are given as follows:
\begin{eqnarray} \phi_j=\frac{1}{m}\sum_{i=1}^m 1\lbrace {z^{(i)}=j}\rbrace ~~~~~~~~~\textrm{(3)}\\ \mu_j=\frac{\sum_{i=1}^m 1\lbrace {z^{(i)}=j}\rbrace x^{(i)}}{\sum_{i=1}^m 1\lbrace {z^{(i)}=j}\rbrace}~~~~~~~~~\textrm{(4)}\\ \Sigma_j=\frac{\sum_{i=1}^m 1\lbrace {z^{(i)}=j}\rbrace (x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^m 1\lbrace {z^{(i)}=j}\rbrace}~~~~~\textrm{(5)} \end{eqnarray}
1) Is it okay to call $z^{(i)}$s class labels and $\phi_j$s class proportions?
2) As far as I understand, we cannot analytically optimize (1) by taking the derivative and equating to zero because of the $\log \sum$ term. Is it correct?
3) How did we write (2)? I think I get the intuition to some level but I cannot correctly write it by using the indicator function etc.
4) How did we obtain (3), (4) and (5)? For $\phi_j$ I got the following which is obviously not true at all: \begin{eqnarray} \frac{\partial}{\partial \phi_j}\cal{L}(\phi,\mu,\Sigma)=\sum_{i=1}^m \lbrace{z^{(i)}=j\rbrace}\frac{1}{\phi_j} \end{eqnarray}
5) If we don't know class labels (like in (1)), is it a constrained optimization problem and do we need to use Lagrange multipliers in this case? Because $\phi_j$ sum up to 1.