2

I am learning the EM algorithm. As I understand the steps are as follows:

1- initialise the parameters.

2- Then start with E-step

3- maximize the log-likelihood function (from E step) and find the new estimate of the parameters.

4-iterate till the difference between the last and new log likelihood is small.

My question stem from em in r on this site. I do not understand why at the first they write log likelihood as this:

loglik[2]<-mysum(pi1*(log(pi1)+log(dnorm(dat,mu1,sigma1))))+mysum(pi2*(log(pi2)+log(dnorm(dat,mu2,sigma2))))

Why do they multiply pi1 twice pi1*(log(pi1).

Xi'an
  • 90,397
  • 9
  • 157
  • 575
Alice
  • 225
  • 2
  • 18
  • I do not understand your question: in the connected X validated question, the code in the answer for the generic M step is `mysum(tau1*(log(pi1)+logdnorm(x,mu1,sigma1)))+mysum(tau2*(log(pi2)+logdnorm(x,mu2,sigma2)))`. The one you produce is only at the initialisation step. – Xi'an Nov 02 '17 at 11:50
  • even for initialization step. It should not multiply by the weights twice, as I understand. – Alice Nov 02 '17 at 12:04
  • in this video and other paper, including the one in the X validated question. The convergence is for the log -likelihood function not for the complete one. – Alice Nov 02 '17 at 12:06
  • @Xi'an as I understand, we first compute the log likelihood at the initial values. Then, we do E step and M step. Then, after M step we can compute the loglikelihood for the new parameters and iterate till convergence. But they make the convergence based on complete loglikelihood. – Alice Nov 02 '17 at 12:09
  • 1
    (i) The initialisation may be at the E or at the M step, it does not make a difference. (ii) The E step involves the expectation of the _complete log-likelihood_ and the M step maximises the function obtained in the E step. – Xi'an Nov 02 '17 at 12:20

1 Answers1

2

I agree with you, this is not the actual formula of the log likelihood. What they should have implemented is

$$ \sum_{j} z_{1,j} [\log(\pi_1) + \log(f(x_j;\Psi^\text{old}_1))] + \sum_{j} z_{2,j} [\log(\pi_2) + \log(f(x_j;\Psi^\text{old}_2))]$$

where $j$ runs over the training samples indices. What they have implemented seems indeed to be

$$ \sum_{j} \pi_1 [\log(\pi_1) + \log(f(x_j;\Psi^\text{old}_1))] + \sum_{j} \pi_2 [\log(\pi_2) + \log(f(x_j;\Psi^\text{old}_2))]$$

HOWEVER this does not make much of a difference: in a more formal language, the $z_{i,j}$ are the conditional densities/probabilities $f_{Z_i|X_j}(i|x_j)$ i.e. something interpretable as the probability that $x_j$ ''comes from '' the $i$-th density in the mixture. Now notice that they only do this weird 'wrong' multiplication in the very beginning... in the loop the seem to have implemented the exact formula above. Phrased differently: If you have no prior knowledge and you have two components, what is the initial probability that some random training sample comes from the first density? You have to specify some value in order to make the next step and improve you prior believe... So, what value do you set initially? Well, since you do not want to put in any prior knowledge it would be a good idea to put $z_{i,j} = 1/\text{amount components}$, i.e. $z_{i,j} = 1/2$ in the case of two densities. By 'accident', this is exactly the initial value that they have assigned to pi1 (well, up to the minus sign... this is a mere question of whether you want to maximize $\log L$ or minimize $-\log L$). Hence, they should have written

loglik[2]<-mysum( -(1/2) * (log(pi1) + ...

but the author has decided to write

loglik[2]<-mysum( pi1 * (log(pi1) + ...

which is confusing (because philosophically it is abolutely wrong, the pi does not have to go in there) but in the end, the evaluated value will be exactly the same...

Does that make sense?

Fabian Werner
  • 3,055
  • 1
  • 9
  • 25
  • 1
    I just wonder, if we converge based on incomplete log likelihood or the complete log likelihood. Based on my knowledge the convergence based on incomplete log likelihood. so, your first equation is used only for E and M step. – Alice Nov 02 '17 at 10:56
  • Also, is EM only works for missing data? – Alice Nov 02 '17 at 10:57
  • 1
    Depends on what you mean by 'missing' ... EM is a general strategy to uncover latent (unobserved) values of additional random variables. For example: In the mixing case, the random variable which tells you 'to which density each training sample belongs to' can be considered as an additional column in your data set with all NA values... EM fills these values for you in a 'senseful way'... – Fabian Werner Nov 02 '17 at 11:02
  • Thanks so much. But if my data is complete and there is no latent variables. Does EM still can be used? – Alice Nov 02 '17 at 11:03
  • If you have some prior knowledge like "training samples 1, 3 and 77 do with absolute certainty come from density 1" then you can just set the respective values in the likelihood to constantly 1 or 0 and then run the EM algorithm to fill the rest of the column – Fabian Werner Nov 02 '17 at 11:04
  • Well, if your data is completely visible to you then what do you want to do with it? :-) – Fabian Werner Nov 02 '17 at 11:04
  • amazing! your help is appreciated. – Alice Nov 02 '17 at 11:04
  • I want to estimate my model parameters – Alice Nov 02 '17 at 11:05
  • For example, if I have a mixture data and would like to fit Gaussian model, then I would like to estimate the parameters and know the weight of each component. – Alice Nov 02 '17 at 11:06
  • 1
    One last note about EM: What EM does is *not* maximizing some likelihood. Indeed, it maximizes some (at first glance completely unrelated) quantity $Q$ but it turns out that one can prove the following: "if we maximize $Q$ then we also maximize the complete likelihood". That is in fact the reason why EM works – Fabian Werner Nov 02 '17 at 11:06
  • Ok. It is clear now. – Alice Nov 02 '17 at 11:07
  • 1
    But then your data *has* some latent variables, namely: 'from which density does each trainign sample come from'. If you know exactly that traingin samples 1,2,3 come from density 1 and 4,5,6 come from density 2 then you can just write down the complete likelihood and maximize this in order to get the parameters for the single density. EM does both: uncover the relation of densities and training samples *and* obtain senseful parameters for every single density. EM gives you the $\mu_i, \Sigma_i$ for the densities and also the $\pi_i$. – Fabian Werner Nov 02 '17 at 11:08
  • Oh. So, latent variables means that I do not know from where the point come from. I am mathematician but doing some stat now. – Alice Nov 02 '17 at 11:10
  • you deserve 5 starts for your great help. – Alice Nov 02 '17 at 11:10
  • No worries, always glad to clarify things :-) – Fabian Werner Nov 02 '17 at 12:22